2013 - 3D Video Coding For Embedded Devices
2013 - 3D Video Coding For Embedded Devices
3D Video
Coding for
Embedded
Devices
Energy Efficient Algorithms and
Architectures
www.allitebooks.com
3D Video Coding for Embedded Devices
www.allitebooks.com
sfdsdfsdf
www.allitebooks.com
Bruno Zatt Muhammad Shafique
3D Video Coding
for Embedded Devices
Energy Efficient Algorithms
and Architectures
www.allitebooks.com
Bruno Zatt Muhammad Shafique
Department of Computer Science Karlsruhe Institute of Technology
Karlsruhe Institute of Technology Karlsruhe, Germany
Karlsruhe, Germany
Jörg Henkel
Institute of Informatics
Department of Computer Science
Federal University of Rio
Karlsruhe Institute of Technology
Grande do Sul (UFRGS)
Karlsruhe, Germany
Porto Alegre, Brazil
Sergio Bampi
Institute of Informatics
Federal University of Rio
Grande do Sul (UFRGS)
Porto Alegre, Brazil
www.allitebooks.com
Acknowledgements
This work was partly supported by the German Research Foundation (DFG) as part of
the Transregional Collaborative Research Centre “Invasive Computing” (SFB/TR 89);
http://invasic.de.
This work was also partly funded by the Brazilian Coordination for the Improvement
of Higher Level Personnel (CAPES—“Coordenação de Aperfeiçoamento de Pessoal
de Nível Superior”) and the Graduate Program on Microelectronics (PGMICRO),
Institute of Informatics, Federal University of Rio Grande do Sul (UFRGS).
www.allitebooks.com
sfdsdfsdf
www.allitebooks.com
Contents
1 Introduction ............................................................................................. 1
1.1 3D-Video Applications ..................................................................... 2
1.2 Requirements and Trends of 3D Multimedia ................................... 3
1.3 Overview on Multimedia Embedded Systems ................................. 5
1.4 Issues and Challenges....................................................................... 6
1.5 Monograph Contribution .................................................................. 7
1.5.1 3D-Neighborhood Correlation Analysis .............................. 7
1.5.2 Energy-Efficient MVC Algorithms ...................................... 8
1.5.3 Energy-Efficient Hardware Architectures ............................ 9
1.6 Monograph Outline .......................................................................... 9
2 Background and Related Works ............................................................ 11
2.1 2D/3D Digital Videos ....................................................................... 11
2.2 Multiview Correlation Domains ....................................................... 14
2.2.1 Spatial Domain Correlation .................................................. 14
2.2.2 Temporal Domain Correlation.............................................. 15
2.2.3 Disparity Domain Correlation .............................................. 16
2.3 Multiview Video Coding .................................................................. 16
2.3.1 MVC Encoding Process ....................................................... 18
2.3.2 Motion and Disparity Estimation ......................................... 22
2.3.3 MVC Mode Decision ........................................................... 27
2.3.4 MVC Rate Control ............................................................... 28
2.4 3D-Video Systems ............................................................................ 29
2.5 Multimedia Architectures Overview ................................................ 30
2.5.1 Multimedia Processors/DSPs ............................................... 30
2.5.2 Reconfigurable Processors for Video Processing ................. 31
2.5.3 Application-Specific Integrated Circuits .............................. 32
2.5.4 Heterogeneous Multicore SoCs ............................................ 33
vii
www.allitebooks.com
viii Contents
www.allitebooks.com
Contents ix
www.allitebooks.com
x Contents
xi
xii List of Figures
Fig. 3.4 MVC energy for distinct mode decision schemes ......................... 56
Fig. 3.5 ME/DE energy breakdown ............................................................ 56
Fig. 3.6 MVC vs. Simulcast complexity..................................................... 57
Fig. 3.7 MVC computation breakdown ...................................................... 58
Fig. 3.8 Memory bandwidth for 4-views MVC encoding .......................... 59
Fig. 3.9 Frame-level energy consumption for MVC................................... 61
Fig. 3.10 Memory requirements for motion estimation at MB level ............ 61
Fig. 3.11 Objective video quality in relation to coding modes ..................... 64
Fig. 3.12 Energy-efficient Multiview Video Coding overview ..................... 66
Fig. 4.1 Coding mode distribution .............................................................. 74
Fig. 4.2 Visual analysis of the coding mode correlation............................. 75
Fig. 4.3 Coding mode hits in the 3D-neighborhood ................................... 77
Fig. 4.4 Variance PDF for different coding modes ..................................... 78
Fig. 4.5 (a) PDF for RDCost difference (between the current
and the neighboring MBs) for SKIP hit and miss; (b, c)
Surface plots of RDCost difference for the SKIP coding
mode hit and miss; (d) RDCost prediction error
for spatial neighbors ...................................................................... 79
Fig. 4.6 PDF of RDCost for different prediction modes in
Ballroom sequence ........................................................................ 80
Fig. 4.7 Average RDCost prediction error for spatial neighbors
in Vassar Sequence ........................................................................ 81
Fig. 4.8 MVC prediction structure and 3D-neighborhood details .............. 82
Fig. 4.9 MV/DV error distribution between predictors and optimal
vector (Ballroom, Vassar) .............................................................. 83
Fig. 4.10 View-level bitrate distribution (Flamenco2, QP = 32) ................... 85
Fig. 4.11 Frame-level bitrate distribution for two GGOPs
(Flamenco2, QP = 32) .................................................................... 86
Fig. 4.12 Basic unit-level bitrate distribution (Flamenco2, QP = 32)............ 86
Fig. 4.13 PDF showing the area of high probability as the
shaded region ................................................................................. 87
Fig. 4.14 PDF of RDCost for SKIP MBs ..................................................... 88
Fig. 4.15 Threshold curves for RDCost ........................................................ 89
Fig. 4.16 PDF of variance for different prediction modes ............................ 89
Fig. 4.17 Overview of the multilevel fast mode decision ............................. 91
Fig. 4.18 Early SKIP threshold curves for (a) RDCost and
(b) Variance ................................................................................... 92
Fig. 4.19 Evaluation of thresholds for early termination
(Ballroom, QP = 32) ....................................................................... 93
Fig. 4.20 Early termination threshold plots for Relax (blue)
and Aggressive (red) complexity reduction................................... 94
Fig. 4.21 MVC coding structure for asymmetric coding .............................. 96
Fig. 4.22 Energy-aware MVC complexity adaptation scheme ..................... 97
Fig. 4.23 Pseudo-code of mode decision for different QCCs ....................... 98
List of Figures xiii
Fig. 5.16 Algorithm for Search Map prediction and the dynamic
formation of the Search window ................................................... 144
Fig. 5.17 Analyzing the memory requirements for ME/DE
of different MBs in Ballroom sequence ........................................ 146
Fig. 5.18 Search window memory organization with power gating ............. 147
Fig. 5.19 Search Map prediction accuracy and on-chip
memory misses .............................................................................. 148
Fig. 5.20 ME/DE hardware architecture block diagram ............................... 149
Fig. 6.1 Spatial–temporal–disparity indexes for the benchmark
multiview video sequences ............................................................ 154
Fig. 6.2 Time savings comparison with the state of the art ........................ 156
Fig. 6.3 Time savings considering the multiple QPs .................................. 157
Fig. 6.4 Time savings distribution summary............................................... 157
Fig. 6.5 Rate-distortion results for fast mode decision algorithms ............. 158
Fig. 6.6 Complexity adaptation for MVC for changing
battery levels .................................................................................. 159
Fig. 6.7 Complexity reduction for the fast ME/DE .................................... 160
Fig. 6.8 Average number of SAD operations.............................................. 161
Fig. 6.9 Fast ME/DE RD curves ................................................................. 161
Fig. 6.10 Bitrate prediction accuracy ............................................................ 162
Fig. 6.11 Accumulated bitrate along the time............................................... 163
Fig. 6.12 Rate-distortion results for the HRC ............................................... 164
Fig. 6.13 ME/DE architecture with application-aware power
gating physical layout .................................................................... 165
Fig. 6.14 Memory-related energy savings employing dynamic
search window technique............................................................... 166
Fig. A.1 JMVC encoder high-level diagram................................................ 176
Fig. A.2 Mode decision hierarchy in JMVC................................................ 177
Fig. A.3 Inter-frame search in JMVC .......................................................... 177
Fig. A.4 Communication in JMVC ............................................................. 178
Fig. B.1 MVC viewer main screen .............................................................. 182
Fig. B.2 Current macroblock-based analysis screenshot ............................. 183
Fig. B.3 Output example: four prediction directions
and their respective accessed areas ................................................ 183
Fig. B.4 Current macroblock-based analysis screenshot ............................. 184
Fig. B.5 Output example: reference frame access index
considering two block matching algorithms:
full search and TZ search .............................................................. 184
Fig. C.1 CES video analyzer user interface................................................. 186
Fig. C.2 CES video analyzer features.......................................................... 186
Fig. C.3 Coding mode analysis using CES video analyzer ......................... 187
Fig. C.4 ME/DE analysis using CES video analyzer .................................. 187
List of Tables
xv
sfdsdfsdf
Abbreviations
3D Three-Dimensional
3DTV Three-Dimensional Television
3DV Three-Dimensional Video (future video standard)
ASIP Application-Specific Instruction-Set Processor
AVC Advanced Video Coding
BR Bitrate
BU Basic Unit
CABAC Context-Based Adaptive Binary Arithmetic Coding
CAVLC Context-Based Adaptive Variable Length Coding
CIF Common Intermediate Format
CODEC Coder/Decoder
DC Direct Current
DCT Discrete Cosine Transform
DDR Double Data Rate
DE Disparity Estimation
DF Deblocking Filter
DMV Differential Motion Vector
DPB Decoded Picture Buffer
DPM Dynamic Power Management
DSP Digital Signal Processing
DV Disparity Vector
DVS Dynamic Voltage Scaling
EPTZ Early Prediction Terminator Zone
FIR Finite Impulse Response
FPGA Field Programmable Gate Array
FPS Frames Per Second
xvii
xviii Abbreviations
www.allitebooks.com
Chapter 1
Introduction
The consumers’ thirst for new and more immersive multimedia technologies allied
to the industry interest to boost the entertainment market has driven the fast popu-
larization of 3D-video content generation, 3D-capable devices, and 3D applica-
tions. Although the first 3D-video device was developed in 1833 and the first 3D
film demonstration dates from 1915, this format only became worldwide known in
the 1980s through IMAX technology. The real 3D-video hype, however, was noticed
in the late 2000s through the massive popularization and availability of 3D movies
followed by the 3D-capable televisions dedicated to home cinema. For a better per-
spective of this popularization, more than 10 % of the televisions sold in USA in
2011 were 3D capable. The latest field to be affected by the 3D-video populariza-
tion is exactly the field responsible for the biggest IC (integrated circuits) industry
growth after the popularization of personal computers: the mobile embedded sys-
tems. Smartphones, tablets, personal camcorders, and other mobile devices ship-
ments already surpassed PC shipments. For instance, more than 650 million
smartphones are expected to be shipped in 2013 compared to 430 million PCs in the
same year. Jointly, the popularization of 3D videos and mobile devices is leading to
a scenario where a large amount of such 3D-capable smart devices is reaching the
users every day, resulting in a large amount of 3D-video content being generated,
encoded, stored, transmitted, and displayed. According to CISCO, video content
already represents 51 % of the current Internet traffic and is envisaged to touch the
90 % mark due 2014. It is also important to consider that the 0.6 Exabytes per
month mobile traffic in 2011 is expected to reach 10.8 Exabytes per month in 2016.
To cover the gap between 3D-video content generation and network and storage
capabilities there is a need to efficiently encode 3D videos and reduce the amount
of data required for their representation. The multiview video coding (MVC), an
extension to the H.264/AVC, is the state of the art on 3D-video coding. Based on
the multiple views paradigm, as the majority of current 3D-video technology, the
MVC reduces the 3D videos representation in 20–50 % compared to H.264/AVC
simulcast. The cost of this efficiency improvement comes from an increased cod-
ing complexity and increased energy consumption, mainly at the encoder side.
The energy consumption incurs from multiple processing units working in parallel
to attend throughput constraints (processors, DSPs, GPUs, ASICs) and intense
memory access. In a scenario dominated by mobile devices, the increase in energy
consumption goes against the battery restrictions posed by these mobile embedded
systems. This conflict of interests between coding efficiency and energy constraints
brings the main challenge related to 3D-video realization on embedded systems:
jointly design algorithmic and architectural energy-efficient solutions to enable
real-time high-definition 3D-video coding, while maintaining high video quality
under severe energy constraints. The main goal of this monograph is to address this
challenge by presenting novel algorithms and hardware architectures designed to
show the feasibility of 3D-video encoding on embedded battery-powered devices.
In the next sections, after this introduction, an overview of 3D-video applications
that make the 3D-video field so promising is presented. After that, a brief introduc-
tion on the trends for 3D-video coding and multimedia embedded systems is pre-
sented, followed by the related issues and research challenges. This chapter is
finalized by a summary containing the contributions of this work.
The adoption of 3D videos is directly associated with the existence of new applica-
tions requiring the deepness sensation in order to improve the users’ immersion
experience. From here onwards an overview of the main 3D-video applications is
presented. These applications share the same concept of capturing multiple views in
the same 3D scene. To give the depth illusion, distinct views are displayed to each
eye with displays that employ technologies based on parallax barriers, lenticular
sheets, color polarization, directional polarization, or time interleaving; more details
on this phenomenon are provided in Chap. 2.
• Three-dimensional video personal recording: Popularized by the 3D-capable
mobile devices and the 3D-video sharing services the 3D-video personal record-
ing is the most massive 3D-video service in terms of video content availability.
With a 3D-video recorder device the users are free to create and publish their
own video content.
• Three-dimensional television (3DTV): 3DTV is an extension of the traditional 2D
with the depth perception. In this kind of application two or more views are
decoded and displayed simultaneously where each viewer sees two views, one
for the right eye, and other for the left eye. The simplest 3D displays, which are
the stereoscopic displays, show two simultaneous views requiring the use of spe-
cial glasses (polarized or active shutter glasses) to provide 3D sensation. The
evolution of stereoscopic displays is the auto-stereoscopic display, which elimi-
nates the need for glasses. In this case, parallax barriers and lenticular sheets are
the most common solutions. Multiview displays are able to display higher num-
ber of views at the same time increasing the observer freedom by supporting head
parallax, i.e., the viewpoint changes when the observer changes its position.
1.2 Requirements and Trends of 3D Multimedia 3
• Free-viewpoint television (FTV): In this application, the user is able to select the
desired viewpoint in a 3D scene. It provides realism and interactivity to the user,
i.e., the focus of attention can be controlled. The display technology used may
vary from 2D televisions to multiview displays.
• Three-dimensional telepresence: Allows the user to communicate and interact
with interlocutors as if they were in the same location. Telepresence has been
widely used for video teleconferencing, mainly in corporative environments, and
for the implementation of the so-called virtual offices. The evolution towards 3D
represents a meaningful step in order to improve the perception and interaction
level between the conference attendees.
• Three-dimensional telemedicine: Telemedicine was defined to surpass physical
limitations and make it possible for a doctor to attend patients or perform surger-
ies while in a distinct location by using telecommunications methods. The
3D-video capability brings the telemedicine to a whole new level where the spe-
cialist can precisely perceive the 3D space and proceed accurately through
robotic actuators. This technology enables a better health care quality in remote
places that do not count on qualified specialists.
• Three-dimensional surveillance: Traditional video surveillance systems rely on
2D videos and pose difficulties to authorities if precise depth information is
required. Employing 3D videos for surveillance provides a much richer informa-
tion once it is possible to accurately extract depth and angulations data for all
objects in the 3D scene. Therefore, a better description on the interaction between
objects, such as possible criminals and victims, is obtained.
Among these applications, some are not designed for mobile use (e.g., 3D
Surveillance and 3D Telemedicine) or require only decoding at the mobile device
(e.g., 3DTV, FTV). For other applications, however, the capability to encode 3D
videos is mandatory. For instance, 3D-video personal recording requires real-time
and energy-efficient 3D-video encoding. 3D Telepresence, when running on embed-
ded devices, demands real-time, energy-efficient, and low-delay 3D-video encod-
ing. Aware of the challenges posed by the presented set of applications, this work
focuses on the MVC video encoder.
a
2000000
1600000
1200000
800000
400000
60fps 0
30fps QHD
HD1080p
XGA
15fps VGA
CIF
b
8000000
7000000 CIF
VGA
6000000 XGA
HD1080p
5000000
QHD
4000000
3000000
2000000
1000000
0
2 4 8
Previous coding standards, for instance MPEG-2, were designed and typically used
in videos with low-medium resolutions and low-medium frame rates such as CIF
(352 × 288), VGA (640 × 480), and SDTV (768 × 576) at 15–30 fps (frames per sec-
ond) (note that these numbers refer to the typical use and main target operation
profiles; the standards define a very high operation range). The H.264 additionally
targets high resolutions and high frame rates such as 720p (1240 × 720) and
1.3 Overview on Multimedia Embedded Systems 5
HD1080p (1920 × 1080) at 30–60 fps. The next generation of coding standards,
represented by H.265/HEVC (High Efficiency Video Coding), will also target on
high and ultrahigh resolutions and frame rates including QHD (3840 × 2160) and
UHDTV (7680 × 4320) videos at 60–120 fps. To quantify this growth, the relation
between the corner cases shown in Fig. 1.1a, CIF@15 fps and QHD@60 fps, is
equivalent to a 327× factor. Also, targeting improved quality, the samples’ bit depth
is increasing from 8 bits up to 14-bit samples, requiring wider data operators. At the
complexity and energy consumption perspective, the scenario is even worse once
there is a nonlinear relation with the data amount. The increase in resolution, for
instance, leads to higher processing effort per MB, higher memory traffic, and larger
on-chip memory related to the Motion Estimation (ME; see Chap. 2), resulting in
energy consumption increase. Moreover, the video coding standards evolution
severely contributes to the increase of complexity and energy requirements. For
example, the H.264 encoder is approximately 10× more complex than the MPEG-4
encoder, while the HEVC is expected to bring additional 2–10× complexity increase
factor in relation to H.264.
Considering 3D videos, the scaling scenario becomes more dramatic, as shown in
Fig. 1.1b. Besides the resolution and frame-rate increase, it is necessary to deal with
the linear data growth in relation to the number of views. As MVC includes new
coding tools the complexity and energy consumption increase in a nonlinear (above-
linear) fashion, as quantified in Sect. 3.1. The impacts of the fast 3D multimedia
requirements scaling on embedded systems are discussed in the next section.
The fast evolution of multimedia embedded systems has been driven by the so-
called smart devices (smartphones, tablets, and other mobile devices capable of
data, audio, and video communication) popularization. Meaningful progress has
been done by the major players in the field, in terms of performance boost and
energy efficiency. The progress, however, is not enough to fill the gap between mul-
timedia application requirements and technology evolution. The ARM specialists,
whose processors equip about 90 % of the current embedded devices, predict a
performance increase in the order of 10× when comparing the state of the art in
2009 to the predicted one for 2016, as shown in Fig. 1.2a. Energy restrictions related
to slow battery evolution is the major factor limiting the performance of embedded
systems. According to Panasonic, the capacity of Li-ion batteries has been increas-
ing, on average, 11 % annually since 1994, as shown in Fig. 1.2b.
The high performance and energy efficiency required by the current 3D-video
applications are not met by generic embedded solutions such as embedded proces-
sors, GPUs, and DSPs. There is a need to implement application-specific hardware
accelerators to deliver the required throughput while minimizing energy consump-
tion at the cost of a flexibility drawback. The latest high-end embedded SoCs
(System on Chip) already implement this approach for multimedia processing,
6 1 Introduction
a
Dual Cortex A15
Dual Cortex A7
Dual Cortex A15 Eight Mali-T658
20 Dual Cortex A7
Relative Performance (x)
Quad Mali-T658
Quad Cortex A9
15 Quad Mali-T604 Superphone
Dual Cortex A9 Cortex A15
Quad Mali-400 MP Cortex A7
Dual Mali-T604
10
Quad Cortex A7 Mid Range
Quad Mali-400 MP
Cortex A8
5 Mali-300 Dual Cortex A5
Cortex A8 Dual Mali-400 MP
Mali-200 Cortex A8
Dual Cortex A7 Entry Level
Mali-400 MP Cortex A8
Dual Mali-400 MP
Mali-300
0
4 Si-alloy anode
material
18%
2 Increase
Annually
11%
1 Increase
Annually
Fig. 1.2 (a) Mobile systems performance trend and (b) Li-ion battery capacity growth
e.g., H.264 video encoding and decoding, as detailed in Sect. 2.5. Some examples
are Qualcomm Snapdragon, Nvidia Tegra, Samsung Exynos, and Texas Instruments
OMAP. The hardware support, however, needs to be extended in order to efficiently
handle 3D videos.
field. In this scenario, employing hardware accelerators optimized for specific MVC
application is mandatory. Given the gap between 3D multimedia processing and the
embedded processing reality, there is a need to further reduce the complexity and
energy consumption at algorithmic and architectural levels. Such optimizations are
only possible by employing deep application knowledge to perform a coupled and
integrated optimization of the algorithms employed and the underlying hardware
architecture.
In addition to the varying coding settings and battery state, multimedia applica-
tions are susceptible to input content variations that significantly change the system
behavior and requirements. For instance, videos with higher motion intensity
require more processing and memory accesses resulting in more processing units
and larger on-chip memory finally leading to increased energy consumption. Such
variations are only detected at run time. Therefore, energy-efficient MVC encoding
systems require algorithmic and hardware run-time adaptivity that employ applica-
tion and video content characteristics knowledge. The adaptation schemes must be
able to handle the energy efficiency vs. video quality trade-off in order to find the
optimal operation point for each given system state and video input.
Energy reduction algorithms and energy-oriented optimizations might lead to
rate-distortion (RD) performance losses, i.e., video quality reduction for the same
bitrate. To avoid or minimize this drawback, there are mechanisms able to control
the losses through the optimization of the bit distribution among different views,
frames, and image regions.
The in-depth study of the issues and challenges related to MVC encoding is
presented in Chap. 3. In the following section, the contribution is summarized.
The goal of this monograph is to understand the run-time behavior of the MVC
encoder at the energy consumption perspective and propose algorithms and hard-
ware architectures able to jointly attend the performance constraints and respect the
energy envelope restrictions for state-of-the-art embedded devices. In this section, a
summary of the contributions of this monograph is presented, highlighting the main
innovations proposed. A deeper description of these contributions is found in Chap.
3, while the technical details are presented in Chaps. 4 and 5, and results in Chap. 6.
The energy-efficient algorithms for MVC are concentrated in three MVC encoding
blocks: mode decision, motion and disparity estimation (ME/DE), and rate control (RC).
Mode decision (MD) and ME/DE units are responsible for the dominant energy con-
sumption in the MVC encoder, as discussed along Chap. 3. The proposed fast MD
and ME/DE target energy reduction through complexity reduction. These algorithms
interact with the novel energy-aware complexity adaptation algorithm that controls
the energy consumption by changing the coding efficiency considering battery state.
The drawback posed by the energy-efficient algorithms comes in terms of quality
drop under certain coding conditions. To minimize this negative impact a hierarchi-
cal rate control (HRC) solution to optimize the bit utilization while maximizing and
smoothen video quality in spatial, temporal, and disparity domains is proposed.
• Multilevel mode decision-based complexity adaptation: Incorporates an Early
SKIP prediction technique to a sophisticated mode decision scheme composed
of six decision steps and bad-prediction protection. This fast MD employs mul-
tiple MD aggressiveness strengths (to control energy vs. quality losses),
3D-neighborhood knowledge, coding modes ranking, video properties-based
prediction, and Rate-Distortion cost (RDCost) prediction. Quantization param-
eter (QP)-based thresholding is employed to react to QP changing scenarios. The
complexity adaptation algorithm employs asymmetric view coding to maximize
the perceived video quality in face of battery discharging and provides graceful
quality degradation along the time.
• Fast motion and disparity estimation: The proposed Fast ME/DE widely exploits
the motion and disparity vectors correlation within the 3D-neighborhood in
order to avoid the search for non/key frames in the MVC prediction structure.
According to the confidence in the neighboring MBs, the algorithm selects the
Fast or Ultra-Fast prediction mode.
• Hierarchical rate control: This innovative solution for the MVC rate controller
employs two actuation levels, frame-level and basic unit-level rate control, with
coupled feedback loop. The frame-level RC uses the Model Predictive Controller
(MPC) to estimate the bitrate for future frames and decide the best QP. Markov
decision process (MDP) with reinforcement learning (RL) and regions of interest
(RoI) weighting is employed at BU level to further optimize the QP selection
within the frames.
1.6 Monograph Outline 9
The energy-efficient hardware architectures target the motion and disparity estima-
tion processing, which represents the most complex and energy-intense coding
block of the MVC encoder. A ME/DE architecture is proposed aiming to reduce the
energy consumption for 4-views real-time HD1080p encoding through on-chip
memory and external memory accesses reduction and efficient dynamic power
management for the processing path and memory architecture. The architectural
innovations are introduced in the following and detailed in Chap. 5.
• Motion and disparity estimation hardware architecture: Along this monograph
is proposed an architectural solution for the ME/DE block in the MVC encoder.
This architecture features techniques to improve the performance and reduce the
overall energy consumption. Our description defines each main building block
composing the proposed architecture and the interaction between them. The
hardware blocks are designed to provide support to multiple search algorithms,
throughputs, and memory hierarchy.
• Multibank on-chip video memory: This proposal enables a reduced on-chip video
memory and sector-level power gating in order to reduce the energy consump-
tion through leakage current lowering. The on-chip memory works in a cache
fashion and employs multiple banks for high throughput. Distinct dynamic
power management (DPM) techniques are proposed based on the memory pre-
diction using the 3D-neighborhood information.
• Memory design methodology: A study of the memory requirements under differ-
ent coding scenarios and video contents is presented to provide the basis for
defining the memory size and organization. Based on this study, an offline statis-
tical analysis is used to define the memory hierarchy considering on-chip mem-
ory size and number of external memory access.
• Dynamic search window formation-based date reuse: Macroblocks previously
encoded in the 3D-neighborhood are used to create a search map that tracks the
search pattern behavior. From the search map, a prefetch scheme named Dynamic
Search Window formation is employed. This technique focuses on the reduction
of external memory accesses and the reduction of active memory sectors in the
on-chip video memory.
• Application-aware power gating: This proposal implements a memory require-
ments prediction scheme to accurately control power states of the on-chip video
memory sectors. Once again, the MBs within the 3D-neighborhood are used as
source of information for decision making.
www.allitebooks.com
Chapter 2
Background and Related Works
In this chapter the basic notions on digital videos, multiview video systems, and the
multiview video coding (MVC) standard are presented. The mode decision, motion
and disparity estimation, and rate control modules are detailed since they are the
main foci of this monograph. Detailed state-of-the-art review is presented consider-
ing 3D-video systems, multimedia architectures, energy-efficient algorithms, and
architectures for video coding.
16 pel
MB
Frame
decimation). The most used color subsampling pattern is the YUV 4:2:0 that stores
one U and one V sample for each four luminance samples reducing in 50 % the total
amount of raw video data.
All current widely used video coding standards are based on block coding. In
other words, they divide each frame in pixel blocks to encode the video. These
blocks are named macroblocks (MB). In the H.264, the latest video coding stan-
dard, the MBs are blocks of 16 × 16 luma pixels and its associated chroma samples
(see Fig. 2.1). A group of MBs is called slice. The slice can be formed by one or
more MBs that may be contiguous or not. One frame is formed by one or more
slices. In turn, each slice is classified in one of three different types (here the SI and
SP slices are not considered): Intra (I), predictive (P), and bi-predictive (B) slices.
The example in Fig. 2.1 is composed of three slices: one contiguous (Slice 0) and
two noncontiguous slices (Slices 1 and 2). Note, the terminology used here is based
on the H.264 standard and is directly applicable to the MVC standard.
For a better comprehension on the different slice types it is necessary to under-
stand the two basic prediction modes used by the state-of-the-art video encoders:
intra-frame and inter-frame prediction. The intra-frame prediction only exploits the
spatial redundancy by using surrounding pixels to predict the current MB. The inter-
frame prediction exploits the temporal redundancy (similarity between different
frames) by using areas from other frames, called reference frames, in order to better
predict the current MB. Intra (I) macroblocks use the intra-frame prediction, while
predictive (P) and bi-predictive (B) macroblocks use the inter-frame prediction.
While P macroblocks only use past frames as reference (in coding order), the B
macroblocks can use reference frames from past, future, or a combination of both.
Intra slices are formed only by I MBs. Predictive (P) slices support I and P macrob-
locks and bi-predictive (B) slices support I and B macroblocks.
Multiview video sequences are composed of a finite number of single-view video
sequences captured from independent cameras in the same 3D scene. Usually these
cameras are carefully calibrated, synchronized, and positioned. They are typically
aligned in a parallel 1D-array or 2D-array; however, there are systems where the
cameras are positioned in arch or cross shapes. The typical spacing between
2.1 2D/3D Digital Videos 13
Multiview 2D-/3D-Display
Multiview Video with auto-
Video Broadcast (Cable, Processing stereoscopic
Processing SatTV, Internet,...) display or FTV
(Multiview
(Multiview Post-
Pre-Processing, Processing,
Multiview Multiview
Video Video
Encoding, Decoding,
Depth Map Depth-based
Estimation, Rendering,
etc.) View Synthesis,
etc.)
cameras is 5 cm, 10 cm, or 20 cm for most of the available test sequences. In Fig. 2.2
a multiview video with four views and the captured frames along the time axis is
presented. At the video encoding perspective, the MVC, as detailed in Sect. 2.3,
extends the concept of inter-frame prediction to inter-view prediction where the cor-
relation between different views is exploited. A deeper discussion regarding the
spatial, temporal, and view/disparity correlations is provided in Sect. 2.2.
Figure 2.3 depicts the complete system required to capture, encode, transmit,
decode, and display multiview videos. The captured sequence is encoded by an
MVC encoder in order to reduce the amount of data to be transmitted. The generated
bitstream may be transmitted using broadcast or Internet or stored in media servers
14 2 Background and Related Works
or local storage. At the decoder side the bitstream, or part of it, is decoded and
displayed according to the displaying technology available at the receiver end. In a
simple single-view display the decoder considers only the base view that is decod-
able with a regular (H.264/AVC) video decoder. In the case of stereoscopic displays
(two views) only two views are decoded and displayed. In free viewpoint television
(FTV) systems the user selects the desired viewpoint within the 3D scene and the
video decoder selects which views to decode. For multiview displays all views dis-
played must be decoded along with the reference views used to reconstruct them.
This section defines the three types of redundancies or correlations present in mul-
tiview video sequences in order to provide the background required for a better
understanding of the MVC coding tools, detailed in Sect. 2.3, and for the
3D-neighborhood concept presented in Sect. 3.5.1. Here we discuss the correlation
at pixel level, i.e., the similarities used to predict the image pixels, and at coding
information level, i.e., how neighboring blocks share coding properties such as
coding modes, vectors, etc. To have a more general description we present indepen-
dently the three correlation dimensions (1) spatial correlation, (2) temporal correla-
tion, and (3) view/disparity correlation. Single-view video coding standards are able
to exploit (1) and (2), while MVC incorporates (3) to provide improved prediction
for multiview videos.
The spatial correlation is the similarity within regions in the same frame. Previous
image and video coding standards, such as JPEG2000 and H.263, were already able
to exploit this similarity through MB prediction based on neighboring pixels (see
Sect. 2.3). Neighboring MBs tend to belong to the same image region and share
similar image properties. For this reason, the surrounding pixels typically are good
block predictors for the intra-frame prediction process. Exception cases happen in
object borders where the image properties may change abruptly. Consider the exam-
ple in Fig. 2.4, all the MBs in the white background share similar image properties.
The same happens for the MBs within one of the objects. The discontinuity occurs
when an object border is found leading to increased prediction error. Note that, for
simplicity, the spatial correlation is referred as one dimension, but it is actually
composed of two dimensions, the width and height of a frame.
On average, the current coding standards are able to efficiently employ the intra-
frame prediction for pixel data. However, the correlation of coding side information
(coding mode, motion vectors, disparity vectors, etc.) is just superficially exploited.
In H.264, a few simple techniques exploit this kind of correlation. The differential
2.2 Multiview Correlation Domains 15
Fig. 2.4 Neighborhood Time
correlation example
Views
T6S1 T7S1
ME
MV
GDV
DE
DV
T7S2
The temporal correlation represents the similarities between different frames in the
same view of a video sequence. That is, the objects of a given frame are usually
present in neighboring temporal frames with a displacement that depends on its
motion. Consider the frames T6S1 (view 1, time 6) and T7S1 (view 1, time 7) in
Fig. 2.4, the same objects are seen in both frames with a small displacement. Thus,
frame T7S1 may be accurately predicted from the reference frame T6S1. The dis-
placement between the two frames is found using the motion estimation (see
Sect. 2.3.2). Besides the pixel-level prediction, the coding data are also similar for
the same object along the time. In other words, for the same object in distinct time
instants the same set of coding modes and motion behavior tend to be employed.
The correlation is lost when there is an occlusion or the object moves out of the
captured scene.
Analogous to the spatial correlation, there are tools able to exploit the temporal
correlation at pixel level, i.e., the motion estimation (Sect. 2.3.2). At coding side
information level, an attempt to exploit this correlation was proposed in the H.264
standard by using the temporal direct prediction for motion vectors. This prediction
uses the collocated MB (MB sharing the same relative position in the frame) motion
vector in order to predict the current one.
16 2 Background and Related Works
(orange and green chrominance). The MVC also supports different subsampling
patterns including 4:2:0 (four luminance samples for one sample of each chromi-
nance channel), 4:2:2 (two luminance samples for one sample per chrominance
channel), and 4:4:4 (one luminance channel for one sample in each chrominance
channel). The supported color space/subsampling and coding tools depend on the
profile of video coding operation (JVT 2009a, b).
Originally three profiles were defined in the H.264 standard: Baseline, Main, and
Extended. The Baseline profile focuses on video calls and videoconferencing. It
supports only I and P slice and the context-adaptive variable length coding (CAVLC)
entropy coding method. The Main profile was designed for high-definition display-
ing and video broadcasting. Besides the tools defined by the Baseline profile, it also
includes the support to B slices, interlaced videos, and CABAC entropy coding. The
Extended profile targets video streaming on channels with high package loss and
defines the SI (Switching I) and SP (Switching P) slices (Richardson 2010). In 2005
the Fidelity Range Extension (FRExt) defined the High profiles: High, High 10,
High 4:2:2 and High 4:4:4 targeting high fidelity videos (JVT 2009a, b).
The MVC extension introduced to the standard a new set of CABAC contexts
and new supplemental enhancement information (SEI) messages to simplify paral-
lel decoding and the transmission of sequence parameters (JVT 2009a, b).
Additionally, the disparity estimation or inter-view prediction was proposed
(Merkle et al. 2007). This is the most important innovation in the MVC that allows
the exploration of similarities between different views. Its function is to find the best
matching for the current macroblock in a reference frame within the reference view.
The possible search criteria, search patterns, and objective are similar to the motion
18 2 Background and Related Works
Current
Frame
(original)
Reference
Frames
(temporal)
Reference
Frames
(disparity)
Current
Frame
(reconstructed)
In Fig. 2.6 the high-level block diagram of the MVC encoding process is presented.
As a hybrid coding standard it is composed of three phases: prediction, transforms,
and entropy coding. The transform and entropy phases are similar to H.264/AVC,
except for the new syntax elements to be encoded by the entropy encoder. The main
innovation is in the prediction phase, which incorporates the inter-view prediction
tool, the disparity estimation (DE).
The base view, the first one to be encoded, is encoded in compliance to the H.264
standard. Then, the prediction has two options, the intra-frame or the inter-frame
prediction. Other views are named dependent views and additionally employ inter-
view prediction. The complete encoding process is described in this section, consid-
ering the Main profile tools in YUV color space with 4:2:0 subsampling, while
further extensions available in the High profiles are omitted for simplicity.
The MVC prediction structure inherits all the possibilities for temporal refer-
ences and coding orders defined by the H.264. In addition, distinct possibilities of
view coding order may be employed. The most used view coding orders are IPP and
IBP (Merkle et al. 2007). The prediction structure depicted in Fig. 2.7 employs IBP
view coding order using hierarchical bi-prediction (HBP) structure in temporal
domain for eight views and group of pictures (GOP) size equals to 8. The set of
GOPs for all views are referred in MVC as GGOP (group of groups of pictures).
The frames located in the GGOP borders are called anchor frames while all others
are the non-anchor frames.
2.3 Multiview Video Coding 19
The intra-frame prediction uses the neighboring pixels within the frame to predict
the samples in the current MB. The MVC supports two MBs partitioning sizes for
intra-frame prediction. The size 4 × 4 has nine prediction directions, as presented in
Fig. 2.8, where modes 0 and 1 apply a simple copy of the neighboring blocks and
modes 3–8 perform a weighted interpolation according to the prediction direction.
Mode 2 (DC) replicates the average of the neighboring samples to the entire block.
Each one of the 16 blocks inside the MB may use different prediction directions in
order to find the best prediction.
The intra-prediction can also be performed using the 16 × 16 block size. However,
in this mode the number of prediction directions is restricted. Figure 2.9 presents the
four prediction directions. Modes 0–2 are analogous to the modes 0–2 of the 4 × 4
block size. The plane mode (3) applies one linear filtering (Richardson 2010) to the
neighboring samples resulting in a gradient texture. The 4 × 4 and 16 × 16 presented
predictions are used for luminance samples. The chrominance prediction uses the
same four directions present in 16 × 16 intra-prediction. The block size depends on
the color subsampling; for the 4:2:0 color subsampling, 8 × 8 chroma blocks are used.
The inter-frame prediction or motion estimation (ME) provides other possibility
of prediction. Its function is to perform a search in the past or future previously
encoded frames to find the best matching candidate in order to provide a good pre-
diction. The ME features bi-prediction, multiple block sizes, motion vector predic-
tion, ¼ sample motion vector accuracy, weighted prediction, and other tools that
help to improve the prediction quality (Richardson 2010), as defined in the H.264/
MVC video coding standard and detailed in Sect. 2.3.2.
For the dependent views (all views except the base one), the inter-view predic-
tion or disparity estimation (DE) is also available (Merkle et al. 2007). This MVC
20 2 Background and Related Works
V V V Média V
(H+V)
extension searches for the best matching candidate in the frames belonging to previ-
ous encoded views (left, right, up, or down, depending on the cameras arrangement
and view prediction structure). All features from ME are supported in DE; more
details about these features and how they influence the encoder efficiency and com-
plexity will be discussed in Sects. 2.3.2 and 3.1.1.
The output of the prediction phase is a large set of prediction candidates. Among
all different block sizes for intra-prediction, inter-frame prediction, and inter-view
prediction, the best prediction mode must be selected by the mode decision (MD) in
order to provide the optimal rate–distortion (RD) trade-off (Richardson 2010). The
rate is the number of bits required to encode the MB and distortion is the objective
video quality measured in peak signal-to-noise ratio (PSNR). To have the optimal
solution all modes must be completely encoded, reconstructed, and evaluated
according to an RD optimization equation. Therefore, the MD (represented by the
selection key in Fig. 2.6) is of key importance since it controls the quality vs.
www.allitebooks.com
2.3 Multiview Video Coding 21
Cb Cr
Y
Fig. 2.10 Block processing order in the transform module
End
f 8x8 chroma
block
g k
h l
a b c d i j
depending on the syntax element being encoded. The coding method is an evolution
of variable length coding to better adapt to multiple contexts. The context-adaptive
binary arithmetic coding (CABAC) is a new tool defined by the H.264/AVC standard
and implements a novel coding technique able to reduce the bitstream size by about
5–15 % (Wiegand et al. 2003) in comparison to the CAVLC encoder. The tables of
probability used in CABAC are updated at bit level and present strong data dependen-
cies. For further information please refer to (JVT 2009a, b; Richardson 2010).
After the entropy coding, the bitstream is assembled and the encoding is com-
plete. However, every macroblock has to be reconstructed to work as reference for
further MBs. For that, the inverse quantization and inverse transforms are applied to
the quantized coefficients (the same data previously sent to the entropy).
Once the residues are inversely quantized, they are added to the predicted block
in order to reconstruct the decoded MB. The reconstruction loop guarantees the
consistency between encoder and decoder sides avoiding drifting between encoder
and decoder. To reduce the blocky effect (due to different prediction modes) in the
reconstructed frames, the standard defines an in-loop deblocking filter (DF). The
filtered MBs are used for displaying and to generate the reference for inter-frame
and inter-view predictions. Intra-prediction uses unfiltered macroblocks inside a
frame. The DF has five filtering strengths and filters the borders of each 4 × 4 block
of the image following the order presented in Fig. 2.12 (Richardson 2010).
Multiview video sequences are usually captured using a high sample rate, over
30 fps, to improve the motion flow and give the observer a sense of smoother
motion. This high frame rate implies in a high redundancy or similarity between
neighboring frames in the time axis. As noticed in Fig. 2.13, frames S0T0 and S0T1
are very similar; hence only the differences between them have to be transmitted.
The algorithm that exploits these inter-frame similarities is the motion estimation
(ME). It searches in the temporal neighboring frames, known as reference frames
(see Fig. 2.14), the region that represents the best match for the current block or
macroblock. Once the best matching block is found, a vector pointing to that
2.3 Multiview Video Coding 23
position, the motion vector (MV) in Fig. 2.14, is generated. Consider, for example,
a background region (one of the yellow boxes in Fig. 2.13), there is no motion
between T0 and T1, so the motion vector m2 is probably zero. The dancers moving
(woman’s face in the yellow box) present a displacement along the time; this dis-
placement is represented by m1. The set of motion vectors of a given frame are
called motion field and represent valuable information to understand the motion of
an object as time progresses.
The cameras that capture the different sequences in a given 3D scene are located
near each other (typically, about 5–20 cm apart) (Su et al. 2006); thus there are
many regions that are shared between neighboring cameras. A very high similarity
is perceived between neighboring cameras as exemplified in frames S0T0 and S1T0
of Fig. 2.13. The MVC defines the disparity estimation (DE) to exploit the redun-
dancy between different views and minimize the transmission of replicated infor-
mation multiple times. The approach of the DE is similar to the ME. It searches for
the best matching candidate block within frames of the neighboring views. The
frame used for search is called reference frame while the view is called reference
view, as shown in Fig. 2.14. Once the matching block is found the position is pointed
by the so-called disparity vector (DV); see Fig. 2.14. The set of DVs in a frame are
referred as disparity field and represent the disparity of the objects between views.
While the length of motion vectors (MV) represents the speed an object is moving
(or the camera is moving) the disparity vectors denote the displacement of a given
object between two views. The disparity depends on the distance between cameras,
24 2 Background and Related Works
Disparity Vector
Best Matching
Reference View
Reference Frame
Search
Window
Motion Vector Best Matching Collocated
Position
Disparity
Estimation
Motion
Estimation
Current MB
Reference Frame
Search Collocated
Window Position
Current Frame
and the distance between the camera and the object (Kauff et al. 2007). The closer
the object is the larger the displacement or disparity. For instance, in Fig. 2.13, the
background presents almost no disparity between S0 and S1 (d2) while the dancers
have a much larger disparity vector (d1). The average disparity vector between two
views considering all objects and background is named global disparity vector
(GDV) (Kauff et al. 2007; Han and Lee 2008); see Fig. 2.7.
The ME/DE search is not performed over the complete reference frame but in a
region called search window (SW) defined by a search range (SR), as shown in
Fig. 2.14, for instance an SR [±16,±16] covers an SW of 33 × 33 samples. Many
search schemes for ME were proposed along the last two decades and their charac-
teristics are well known. The exhaustive search algorithm, the Full Search (FS)
(Yang 2009), provides the optimal results at the cost of a very high computational
2.3 Multiview Video Coding 25
effort. Many fast algorithms focusing on complexity reduction with small quality
loss are found such as Log Search (JVT 2009a, b), Diamond Search (DS) (Kuhn
1999), Three Step Search (TSS) (Jing and Chau 2004a), UMHexagon Search (Chen
et al. 2002) and Enhanced Predictive Zonal Search (EPZS) (Tourapis 2002), to list
a few. These algorithms are based on multiple search steps oriented by geometric
shapes. The most recent schemes also consider the neighboring MBs as predictors
to define the search starting point. Using predicted starting point is an evolution
compared to the previous search schemes that use the collocated MB as starting
point. Recalling, the collocated MB is the macroblock in the reference frame that
belongs to the same relative position of the current MB.
Despite the similarity between ME and DE there are behavioral differences that
make solutions defined for ME inefficient when applied to DE. For instance, most
of the traditional ME fast search patterns perform badly for DE. The reason is that
motion vectors are usually located in a relative small length range while disparity
vectors usually are much longer. The disparity vectors frequently have 50–100 sam-
ples length. For this reason, the recommended search range is at least [±96,±96] for
SD resolutions (Xu and He 2008). In this scenario most of the fast algorithms tend
to fall in local minima and do not find the optimal candidate. For this reason the
JMVC, the reference software for MVC (JVT 2009a, b), implements the TZ Search
that is more complex in comparison to DS and EPZS, for example, but is still 23×
times faster than FS (Yang 2009). The TZ employs predictor centered search start
and a larger geometric shape search pattern. It performs well for both ME/DE with
negligible or no quality loss in comparison to FS.
However, once the conceptual tasks of ME and DE are similar, the available
features are the same and together they represent the most computational and mem-
ory intensive tasks in the video encoder; see discussion in Sect. 3.1. For this reason,
ME/DE have to be jointly considered in order to propose smart fast algorithms and
efficient architectural solutions for real-time MVC encoding.
In the following, the motion and disparity estimation features are detailed. Note
that all these tools are mandatory at the decoder side depending upon the operation
profile but are optional for the encoder.
Bi-prediction: In MVC, two types of MBs employ the ME/DE: Predictive (P),
which is coded using inter-frame prediction referencing only past frames and back-
ward views, in display order, or bi-predictive (B), which is coded using reference
frames both from past/backward and from future/forward (this is possible due to the
out-of-order coding and decoding allowed by the standard). In a B macroblock,
each partition can be predicted from one or two reference frames (Sullivan and
Wiegand 2005). In case of bi-prediction the final prediction is generated by calculat-
ing the average of the prediction from past/backward and future/forward.
The reference frames are stored in two lists: List 0 and List 1. List 0 orders the
frames from the past and backward views and List 1 orders the frames from the
future and forward views (JVT 2003). Both lists can be ordered using temporal
references first or disparity references first. For temporal references first, in List 0
the reference index 0 is the closest past encoded frame. For disparity references
26 2 Background and Related Works
first, the index 0 in List 0 is the closest backward view reference frame. Analogous
organization is observer in List 1.
Multiple block sizes: MVC allows ME/DE blocks of several sizes. The 16 × 16 MB
can be segmented in two 16 × 8, two 8 × 16, or four 8 × 8 partitions (JVT 2009a, b).
Each 8 × 8 partition can be segmented in other two 8 × 4, two 4 × 8, or four 4 × 4 sub-
partitions. Each partition may point to one reference frame per list (List 0 and List 1)
while each sub-partition may use only the frames referenced by the partition that it
belongs. Each partition or sub-partition may have a single MV or DV.
Multiple reference frames and reference views: Differently from earlier standards, in
MVC the past and future reference frames are not only fixed to the immediate ones.
Therefore, to reconstruct one given macroblock, temporally distant frames can be
used in the prediction process and this distance is limited only by the size of the
decoded picture buffer (DPB) (Sullivan and Wiegand 2005). The reference frames
are managed in List 0 and List 1 as previously cited. Analogously, the reference
views are not restricted to the closest backward or forward views, any previously
encoded views may be used as reference depending on the coding settings.
Quarter-sample motion vector accuracy: In general, the motion of blocks does not
match exactly in the integer grid of pixels in a frame, and then fractional-sample
motion vector accuracy is used to reach a better match. The MVC (JVT 2003)
defines the use of a quarter-sample motion compensation for the reference frame
blocks. For luma samples, a six-tap FIR filter is used to interpolate half-samples,
and then a simple average of integer and generated half-samples is used to generate
the quarter-sample interpolation (JVT 2003). When working with 4:2:0 subsam-
pling, the chroma samples interpolation applies 1/8 sample accuracy.
Weighted prediction: The MVC defines a weighted prediction in the inter-frame coding
process to apply a multiplicative weighting factor and an additive offset to each inter-
polated sample of a given reference frame. For single directional prediction from List
0 or List 1 this tool is defined as presented in Eq. (2.1), where “x” is replaced by the list
number (0 or 1), “w” is the weighting factor, “logWD” is a scaling factor, and “o” is the
additive offset. P represents the interpolated pixels and P′ the weighted sample. For
bi-predictive prediction the weighted prediction is defined as presented in Eq. (2.2):
the spatial neighbor blocks vectors. However, SKIP macroblocks and direct
predicted macroblocks (macroblocks with no transmitted residue or motion vectors)
are differently processed using the direct spatial or direct temporal predictions. The
motion/disparity vector prediction is one example of using the video correlation to
predict coding side information, as previously mentioned in Sect. 2.2.
The MVC provides a big number of options for the macroblocks prediction. Intra-
prediction defines two prediction sizes (three in case FRExt is considered), 16 × 16 and
4 × 4, with four and nine prediction modes, respectively. ME evaluates multiple candi-
date blocks for seven different block sizes. Additionally, the new disparity estimation
adds a set of coding possibilities as large as the motion estimation possibilities.
The mode decision (MD) module is the responsible to deal with this large opti-
mization space. For that it implements an optimization algorithm and defines a cost
function called RDCost, the rate–distortion cost (a.k.a. J cost). The objective is to
evaluate the coding modes and to find the one that minimizes the RDCost to obtain
the best coding relation between rate and distortion. Equation (2.3) presents the J
function where c and r represent the current original MB and the reconstructed one,
MODE is the prediction mode used, and QP is the quantization parameter. D repre-
sents the distortion measured after the complete MB reconstruction according to a
distortion metric and R is the number of bits used to encode the current MB; this
number is available once the entropy encoding is completed. λ is the Lagrange
Multiplier used to control the rate–distortion trade-off. The Lagrange Multiplier
value is not defined by the standard; however, typically it is defined by the Eq. (2.4)
and depends upon the QP. To quantify the distortion, different metrics may be used;
some examples are sum of absolute differences (SAD), sum of absolute transformed
differences (SATD), and sum of square errors (SSE). The SSE is mostly used in the
mode decision step since it provides better PSNR results. The reason is that PSNR
is calculated using mean square errors (MSE) which is only a division of SSE value,
so the SSE is directly related to PSNR. It is important to understand that PSNR is
currently the widely most accepted objective video quality metric. However, SAD
is widely used in real-time systems due to its light-weight computation:
Although the algorithm to find the mode that minimizes the RDCost is not defined
by the standard, the MVC reference software JMVC implements an exhaustive
search by completely encoding all possible coding modes and selecting the best
mode. It is known as rate–distortion optimized mode decision (RDO-MD) also
referred as Full RDO or Exhaustive RDO. The RDO-MD guarantees the optimal MB
encoding but drastically increases the encoder computational effort and makes the
same approach for real-time MVC encoding unfeasible for the current technology.
28 2 Background and Related Works
Time GGOP
View
V0 I B B B B B B B I
V1 B B B B B B B B B
V2 P B B B B B B B P
V3 P B B B B B B B P
16 pel
16 pel
...
MBx MBx+1 ...
According to, Li et al. (2003) the rate control (RC) is a block of the video encoder
that aims to regulate the output-coded bitstream to produce high video quality at a
given target bitrate. In the MVC scenario, an efficient RC scheme must be able to
provide increased video quality for a given target bitrate with smooth visual quality
variation along the time, for different views and within the frames. Most impor-
tantly, the RC should keep the bitrate as close as possible to the target bitrate (opti-
mizing the bandwidth usage) while avoiding sudden bitrate variations.
The rate control unit typically controls the quality vs. bitrate through QP adapta-
tion. The bitrate and/or the video distortion metric are predicted using a prediction
model. According to the prediction and the target bitrate (amount of bits per second
used to encode the video), an adequate QP is selected. As QP grows higher, more
residual data are quantized (video details lost) and more quality losses are inserted.
The actual generated bitrate and the video quality may be used as feedback for the
RC unit in order to update the prediction and QP definition. The QP adaptation may
be performed in distinct actuation levels. In general, the RC for MVC can be classi-
fied in at least three actuation levels (1) GOP level (group of pictures—set of frames),
(2) frame level, and (3) basic unit (BU—set of one or more macroblocks MB) level,
as shown in Fig. 2.15. It is possible to combine GOP level and frame level, and based
on this observation, for simplicity, they are jointly discussed in this monograph.
In the following sections we present the state of the art related to 3D-video sys-
tems, MVC encoders, and multimedia processing. Also, a literature overview on the
latest low-complexity and energy-efficient solutions focusing on mode decision,
2.4 3D-Video Systems 29
ME/DE and rate control for the MVC standard is presented. An overview on
low-power techniques is also provided to give the technical background required for
our energy-efficient architectural solutions.
The advances in video coding techniques targeting multiview videos have been
driven by the increasing set of commercial systems employing 3D-video capabili-
ties. These systems range from high-end cinemas and 3DTVs to mobile devices
including content suppliers. Wider adoption is expected for the upcoming years
with the increase in the available video content through 3D-capable television
broadcasters, optical media (Blu-ray Disc Association 2010), popularization of per-
sonal 3D camcorders, 3D-video stream services (YouTube 3D 2011; Vimeo 2012),
etc. All these commercial systems are, currently, based on stereoscopic videos (two
views). An increase in the number of views is expected for the near future (Fujii
2010) to improve the observer freedom and provide a more immersive experience.
Some experimental and academic multiview systems are already available or under
development to support the next generations of 3D-video systems. In this section we
start presenting the most prominent commercial 3D-video systems.
3D-cinema systems are based on three market-leader technologies based on ste-
reoscopic videos (IMAX 2012; RealD 2012; Dolby 2012). The technology employed
in IMAX (2012) requires the use of linear polarized glasses to block the light for
one eye at a time allowing each eye to see only the frames intended for that eye. Two
projectors are used to display 48 fps where each eye is able to effectively see 24 fps
in a time-sharing strategy. RealD (2012) is also based on time-sharing between the
two eyes; however, the glasses are circular polarized glasses where each glass is
polarized in an opposite direction. Also, RealD (2012) requires a single projector
able to display 144 fps. Each effective frame for each eye is displayed thrice result-
ing in effective 24 fps per eye. Finally, Dolby (2012) employs passive glasses with
dichroic filters where each view is displayed with a distinct chromatic filter and
perceived by a single eye. With this strategy both views are simultaneously dis-
played allowing the use of regular 24 fps projectors. At the video coding perspec-
tive, all these high-end applications support the stereoscopic MVC (The Digital
Entertainment Group 2009).
Stimulated by the content available in the 3D Blu-Ray (Blu-ray Disc Association
2010)—optical media that supports the MVC coding standard—the 3D televisions
already exceed 10 % of the televisions sold in the United States, in 2011, and this
number is expected to reach 37 % of the market share in 2014 (Research and
Markets 2010). Other countries are expected to follow this trend. The majority of
those 3DTVs are based on stereoscopic displaying and require active shutter or pas-
sive polarized glasses to provide the 3D sensation. Many devices employ built-in
decoders supporting the MVC standard. Along with the cinema solutions, the 3D
televisions are not energy-critical and typically implement only the video decoder
side, less complex in relation to the encoder.
30 2 Background and Related Works
Currently, portable devices capable of handling digital video are available every-
where for a reasonable cost. The omnipresence of these gadgets implies a very large
amount of data being produced. In this scenario the coding efficiency is a key issue
in order to reduce the storage and transmission costs for digital video. Various
devices are also capable of real-time 3D-video recording, such as Panasonic (2011),
Fujifilm (2011), Sharp (2011), and Sony (2011). Most of them feature two cameras
and encode the video sequences independently (simulcast). However, the increase
in number of views from 2 up to 4–8 views (Fujii 2010) in order to provide enhanced
3D experience freedom is envisaged for the next 3–5 years. In this scenario it is
simple to conclude that the large amount of data generated requires the use of the
state-of-the-art MVC standard. The first personal camcorder to fully support stereo
MVC was released by Sony in 2011.
Although 3D-capable mobile devices are already available, attending MVC per-
formance and energy constraints remains a big challenge for industry and academia,
as discussed in Sect. 3.2. The current multimedia processing systems based on pro-
cessors, DSPs, and non-MVC-optimized application-specific integrated circuits
(ASIC) implementations are not efficient to provide the required throughput with
the required energy efficiency while sustaining video quality and coding efficiency.
In the following section we present an overview of the main multimedia architec-
tural approaches and solutions in the current state of the art.
www.allitebooks.com
2.5 Multimedia Architectures Overview 31
In Otero et al. (2010) an architectural template for run-time scalable systolic copro-
cessors is presented. It focuses on run-time adaptation to time-variable tasks or
changing system conditions. It exploits replacing and relocation of basic processing
elements of the array using FPGAs dynamic reconfiguration. In Beck et al. (2008)
is employed a coarse-grained reconfigurable array with a run-time mechanism
designed to translate MIPS instruction to be executed in the reconfigurable array.
Berekovic et al. (2008) present the mapping of MPEG-2 and H.264/AVC to the
ADRES (coarse-grain reconfigurable processor) delivering throughput for real-time
CIF decoding at 50 MHz with a 4 × 4-core array. CRISP, a coarse grain reconfigu-
rable stream processor (Chen and Chien 2008), implements an image processing
pipeline reaching 55 fps for HD1080p resolution. Aggressive performance losses
are expected for video coding due to increased complexity compared to the imple-
mented image processing algorithms.
In Bauer et al. (2007), the rotating instruction set processing platform (RISPP) is
presented bringing more flexibility to extensible processors. It features a special
instruction forecasting algorithm able to predict the hotspots and allows to adapt at
32 2 Background and Related Works
run time the different Molecules (implementation of the special instructions). This
architecture was evaluated using some H.264 processing hotspots (SATD, DCT,
etc.) and demonstrated high flexibility to deal with the performance vs. hardware
trade-off. This concept was extended in Bauer et al. (2008a) by integrating a special
instruction run-time scheduler able to outperform the state of the art in 2.38× for the
H.264 application. When integrated to a transmutable embedded processor (Bauer
et al. 2008b), the RISPP concept was able to present up to 7.19× speedup in relation
to related works for H.264.
Compared to regular processors, reconfigurable processors target to increase the
overall performance by adapting, at run time, to distinct applications properties.
Also, the adaptivity can be efficiently exploited within the same application.
Considering multimedia applications, the performance/energy requirements may
vary with the video content, user settings, battery level, etc. It brings a big optimiza-
tion potential at the system perspective. However, when considering a single appli-
cation for a given description, in this case real-time encoding for MVC HD1080p,
the profit of this adaptive behavior is not perceived. Moreover, in this scenario, the
energy and time costs for reconfiguration pose additional difficulties in terms of
throughput and energy efficiency if compared to processors, DSPs, and ASIPs.
2.6 E
nergy-Efficient Architectures for Multimedia
Processing
In this section we introduce the state of the art on energy management along with an
overview on video memories, energy-efficient techniques, and architectures for
multimedia processing. Additionally, the infrastructure to support dynamic voltage
scaling (DVS) on SRAM memories and the dynamic power management (DPM)
schemes are presented. This technique is extensively used in the literature and in the
solutions proposed along this monograph.
34 2 Background and Related Works
On-chip memories are becoming dominant part of current systems, mainly for
signal processing systems. In the scope of video coding, the video memory repre-
sents the main on-chip memory component responsible for storing frames used as
reference to encode other frames. In the current literature are found solutions spe-
cific for video/frame memories or generic solutions for any video/image processing
tasks. Some of these solutions are described in the following.
The work in Grun et al. (1998) proposes a memory size estimation method for
applications containing multidimensional arrays such as video processing. The
memory estimation is generated from the application algorithm specification. The
paper also addresses the discussion relating parallelism to the memory size. Zhu
et al. (2006) present a memory size computation method for multimedia algorithms.
The solution uses algebraic techniques and the theory of integral polyhedral to com-
pute exactly the memory size for multimedia algorithms. The authors in Yamaoka
et al. (2005) use a triple-mode SRAM to implement an on-chip memory for mobile
phone application. The on-chip memory is composed of four SRAM banks that can
be managed by a leakage state controller.
In terms of specific video memories, the authors of Shim and Kyung (2009) propose
a video memory composed of multiple on-chip memories employing a data reuse to
reduce the external memory access. A memory switching method is defined to increase
the utilization of on-chip memory. Tsai et al. (2007) present a low-power cache for the
reference frame memory of H.264/AVC. This work uses the block translation cache
architecture and a search trajectory prediction prefetching scheme. The authors claim a
35 % memory writing power reduction with 67 % memory static power reduction.
The static energy due to leakage current represents a significant source of the total
energy consumption in deep submicron technologies. Also, current integrated cir-
cuit footprints are dominated by embedded memories which are typically imple-
mented as SRAM (static random access memory). Therefore, reducing SRAM
static consumption is a key challenge to reach overall energy reduction.
The fabrication technology evolution has provided meaningful contribution to
leakage reduction by employing high-K oxides (Huff and Gilmer 2004), FinFET
transistors (Pei et al. 2002), etc. Along this monograph we assume the use of an
on-chip SRAM memory featuring multiple power states with data retention
capabilities. The high-level memory organization is presented in Fig. 2.16. An
implementation for this memory organization including a picture of the silicon die
is demonstrated in Zhang et al. (2005). Still, there is a need to further reduce the
leakage at architecture and system levels through techniques such as power gating,
DVS, and DPM. In Sects. 2.6.3 and 2.6.4, we present an overview of power/energy
management techniques for memories and multimedia systems, respectively.
2.6 Energy-Efficient Architectures for Multimedia Processing 35
SRAM Memory
Sleep
Bank 0 Bank m Transistor
Sector 0
Sector 0
Power Supply
Sector 1 Power
Management
Sector n Sleep
Transistor
Sector n
Bit Line
In Fukano et al. (2008) a DVS using dual power supply is used to implement a
65 nm SRAM memory employing three operation modes (1) high speed, (2) low
power, and (3) sleep mode. In this work the low power and sleep modes are data
retentive avoiding data refetching, but it does not support partial DVS for specific
sectors of the SRAM. Yamaoka et al. (2004) present a similar solution employing
three operation modes while supporting bank-level DVS. It achieves leakage reduc-
tion through adapting the virtual supply voltage using PMOS transistors. Finally,
the 65-nm SRAM design presented in Zhang et al. (2005) provides more flexibility
through adopting multiple power states and fine grain power control. The DVS is
controlled at sector level using a custom NMOS sleep transistor to control the vir-
tual ground voltage.
Energy and power management for multimedia systems has been studied in many
research works mostly targeting embedded applications. The authors in Cao et al.
(2010) employ DVS with five distinct voltage levels. It is controlled using the
application-specific knowledge through workload modeling for a wavelet video
encoder. In Kamat (2009) a battery level-aware MPEG-4 video encoder with a noti-
fication manager and an application-specific controller is presented. Some solutions
exploit the energy vs. video quality trade-off at run time to adapt to the system
scenario. Ji et al. (2010) partition the input data in distinct profiles used for energy
budgeting generating scalable video quality according to the energy scenario.
Similar work is presented in Ji et al. (2009) applying game-theory algorithms to
control the video encoder. Liang and Ahmad (2009) propose a rate–complexity–dis-
tortion model to progressively adjust the H.263+ encoder behavior considering the
video content. It employs DVS providing and reaches up to 75 % energy reduction.
A power-rate–distortion model (He et al. 2008) is used for energy minimization in
video communication devices by exploring energy trade-off between video encod-
ing and wireless communication providing up to 50 % energy reduction. A dynamic
quality adjustable H.264 encoder is proposed in Chang et al. (2009). It defines qual-
ity states to change the number of coding modes considering the power vs. quality
trade-off. The implemented ASIC provides real-time 720p encoding at 183 mW
consumption. The proposals summarized in this section are useful at the MVC sce-
nario but lack the MVC-specific knowledge such as workload model, quality states,
rate–distortion behavior, etc. Thus, the simple application of these approaches leads
to inefficient energy management performance.
Authors in Javed et al. (2011) presented an adaptive pipelined MPSoC for H.264/
AVC with a run-time system that exploits the knowledge of macroblock characteriza-
tion based on their spatial and temporal properties (Shafique et al. 2010; Shafique
et al. 2010a) to predict the workload. Based on this knowledge, unused processors are
clock-gated or power-gated. These techniques provide limited energy-efficiency in
MVC as they cannot exploit the MVC-specific knowledge such as (a) distribution of
memory usage at frame and MB levels and (b) memory usage correlation in the
3D-neighborhood.
2.6 Energy-Efficient Architectures for Multimedia Processing 37
The work of Shafique et al. (2010) presents an energy budgeting scheme for the
H.264 ME. This solution considers the total energy available along with the video
properties to dynamically define a search pattern able to deal with the energy vs.
quality trade-off. Each frame is classified into one of six energy classes and further
classification refinement is performed at MB level. The highest complexity class
performs a search composed of three search patterns (Octagon Star, Polygon and
Diamond) without samples subsampling. The lowest complexity class employs a
Diamond-shaped search using 4:1 subsampling. The highest complexity class
requires 17× more energy when compared to the lowest complexity class.
The authors in Chen et al. (2006) evaluated different state-of-the-art data reuse
schemes (Level-A, Level-B, Level-C, and Level-D) and proposed a new search
window-level data reuse for H.264 ME (Level-C+) in order to reduce the energy
consumption related to external memory access and on-chip memory storage.
Level-A and Level-B solutions are based on candidate blocks. While Level-A
fetches and stores on-chip a single candidate block, Level-B fetches a whole candi-
date stripe (inside the search window). They require frequent external memory
access and only fit with regular search patterns which is not the case for state-of-the-
art ME/DE algorithms. Level-C and Level-D follow the same logic but at search
window level. Level-C stores one search window (avoiding the retransmission of
overlapping search window regions accessed by neighboring MBs in the same line)
and Level-D a search window stripe for the whole frame. Observe that Level-D
requires a extremely large on-chip memory for large search window or frame size.
As Level-C presents a reasonable trade-off between external memory access and
on-chip memory size it was extended in Level-C+. Level-C only exploits the data
reuse between horizontal neighboring MBs. Level-C+ proposes to increase the ver-
tical on-chip storage to include the search window of the MB line below. This allows
exploiting the vertical data reuse at the cost of increased on-chip memory and out-
of-order processing (two MB lines are processed using double-Z order).
In Wang et al. (2009) a bandwidth-efficient H.264 ME architecture using binary
search is proposed. This solution employs a frame-level preprocessing that downs-
amples the image twice in a factor 2. It results in three images (or three layers), the
original image, the downsampled image, and the twice downsampled image. After
that, a search is performed in the three layers. This technique is also modified to
allow parallel processing and easy hardware implementation. A hardware architec-
ture is presented targeting low power through low memory access, efficient hard-
ware utilization, and low operation frequency.
A complete MVC encoder targeting low-power operation is presented in Ding
et al. (2010) employing eight pipeline stages, dual CABAC, and parallel MB inter-
leaving. A cache-based solution is used for the search window reading along with a
specific prefetching technique. The cache tags are formed by the frame index and x
and y block position. Also, each cache entry stores an image block (instead of words
like in generic caches) following the same concept proposed in Zatt et al. (2007).
The search is constrained to a [±16,±16] search window with a predicted center
point. The ME/DE architecture is described in more details in the previous work
38 2 Background and Related Works
from the same group (Tsung et al. 2009). This approach might lead to quality loss
when the center point prediction is not accurate. Also, the authors ignored the fact
that fast ME/DE schemes already consider this information to start the search. The
MVC encoder is able to real-time encode four views HD720p at 317 mW.
Generally, the search window-based data reuse approaches suffer from excessive
leakage resulting from big on-chip SRAM memories required to store the complete
rectangular search windows. This point becomes crucial for MVC as the DE requires
relatively large search windows (mainly for high resolutions) such as [±96,±96] to
accurately predict high disparity regions (Xu and He 2008). In this case, even con-
sidering asymmetric search windows incurs in large on-chip storage overhead, thus
suffering from significant leakage.
The authors in Shim and Kyung (2009) use multiple on-chip memories to realize
a big logical memory or multiple memories (one for each reference frame) accord-
ing to the frame motion. A search window centered prediction is employed for data
prefetching while the size of search window is dynamically adjusted at frame level
using the size of motion vectors found in previous frames. The data reuse scheme
Level-C is employed.
A data-adaptive structured search window scheme is presented in Saponara and
Fanucci (2004). An adaptive window size approach is proposed considering the spa-
tial/temporal correlation of the motion field. If the vectors of the current and past
frames do not exceed a given value, there is no need to search in a region larger than
this vector size and the fetching of a reduced window is necessary. In case the window
is too small and the error starts do increase, a test detects it and the search window is
increased regardless of the neighborhood. This solution leads to reduced external
memory access, but its potential for on-chip memory reduction is not discussed.
The work in Chen et al. (2007) proposed a candidate-level data reuse scheme and
a Four-Stage Search algorithm for ME. Firstly, multiple search start points are pre-
dicted from the neighboring MBs motion activity. The predicted points are evaluated
and the best one is selected for a Full Search around its position. A ladder-shaped data
arrangement is also proposed in order to support random access for the proposed
algorithm. The candidates parallel processing is performed using a systolic array.
In Tsai et al. (2007) a caching algorithm is proposed for fast ME. Additionally, a
prefetching algorithm based on search path prediction is proposed in order to reduce the
number of cache misses. The work (Tsai et al. 2007), however, is limited to a fixed Four
Step Search pattern and it does not consider disparity estimation and power gating.
a VDD b VDD
ILeak,p
VI<VTN VCC VDD Vo(low)
ILeak,n
c VDD d VDD
ISC
Vin Vout Vin Vout
CL CL
In Fig. 2.17 are represented the three main power dissipation sources for CMOS
circuits using an inverter as example: leakage current (static), switching power
(dynamic), and short circuit current (dynamic). Eq. (2.5) shows the total power in
terms of these three components. The static power dissipation is a result of the leak-
age currents. Consider Fig. 2.17a where the input voltage (VI) is lower than the
NMOS transistor threshold voltage (VTN). In this case, an ideal inverter NMOS tran-
sistor does not conduce any current. However, real MOSFET transistors cannot
completely block this current, the so-called leakage current. The closer VI is to VTN,
the stronger the leakage. The same happens to PMOS transistors when a VI > VTP is
applied to the gate (Fig. 2.17b). The leakage power for the case represented in
Fig. 2.17b is calculated by Eq. (2.6).
The dynamic power is composed of two components: the switching power
(Fig. 2.17c) and the short circuit power (Fig. 2.17d). Equation (2.7) defines the
switching power that linearly depends on the load capacitance (CL, that depends on
the fanout of the device), the source voltage VDD, the frequency of operation (f), and
the frequency of switching of that device (α). It represents the energy that is charged
in the load capacitance and later drained to the ground. Note that only after two
switches the energy is actually drained; in the first time instant (shown in Fig. 2.17c)
the capacitance CL is charged and in the second time instant (after another switch)
the energy is drained from CL to ground. It justifies the ½ factor in Eq. (2.7). The
short circuit current happens while the input signal changes VDD-GND or GND-VDD.
There is a given input voltage where both PMOS and NMOS transistors are con
ducing and a current is drained directly from VDD to the ground. It is depicted
by the current in Fig. 2.17d and the short circuit power is defined by Eq. (2.8).
The total energy drained is the total power along the time (t) as represented in
Eq. (2.9). Other power dissipation sources (such as gate leakage) exist in the CMOS
devices, but they are omitted in this short overview for simplicity reasons.
40 2 Background and Related Works
As can be seen from this overview it is possible to reduce both static and dynamic
power. For instance, reducing the computation reduces the dynamic power once α is
reduced. If frequency scaling is used, f is also reduced. Moreover, if voltage scaling
is used the dynamic energy is reduced in a quadratic order because VDD is reduced.
For leakage reduction, circuits featuring multiple thresholds are used. Hardware
support is required; however, application knowledge and energy-aware control
algorithms are required to accurately control thresholds, frequency, and voltage:
The mode decision is one of the main contributors for the MVC high complexity
and consequent energy consumption. The optimal solution using the exhaustive
RDO-MD requires the evaluation of all possible inter-prediction and intra-prediction
modes defined by the standard. Such solution is not feasible for real-world imple-
mentations. Thus, there is a need to reduce the number of evaluated modes during
the coding process. Statically defining modes to be tested does not perform well due
to changing coding parameters and video input characteristics. For this reason, it is
necessary to dynamically define the most probable coding modes using the run-time
available data. Figure 2.18 shows a hypothetical fast MD scheme which selects a
few candidate modes out of all possible modes. Current solutions, as detailed along
this section, use information extracted from the video content (texture, edges,
brightness), coding mode history, video geometry, etc.
Several fast MD schemes have been proposed to reduce single-view H.264 com-
plexity, such as fast I-MB MD, fast SKIP MD, fast P-MB MD, and the combination
of the above. These fast mode decisions of H.264 may be deployed for MVC.
However, they will perform inefficiently for the non-anchor frames as they do not
exploit the inter-view correlation and the knowledge of GDV.
Recently, multiple fast MD schemes have been proposed for MVC (Peng et al.
2008a; Lee et al. 2008; Han and Lee 2008; Shen et al. 2009a, b, 2010a; Ding et al.
2008a; Zeng et al. 2011; Chan and Tang 2012) considering the GDV, camera
geometrical properties, inter-view correlation, and early SKIP prediction.
www.allitebooks.com
2.8 Energy-Efficient Algorithms for Multiview Video Coding 41
SKIP SKIP
8x8 4x8 8x4 4x4
Fast Mode
Inter-Frame/View Prediction Modes Decision DC
Vertical
Horizontal
X X
DC
Plane
The authors in Lee et al. (2008) proposed an object-based mode decision that uses
image segmentation to evaluate different prediction modes for foreground and back-
ground regions. The image is segmented using a motion-based approach, considering
the vectors size and the SAD in relation to the collocated block (in the same relative
position). In case the motion vector significantly defers from the vector average (respect-
ing a threshold) and the SAD exceeds a given value the region is considered as a fore-
ground object; otherwise it is a background. A region growing is used to merge the
foreground objects. The foreground regions are coded using DE, while the background
are coded using ME. The boundary MBs are coded using the exhaustive RDO-MD.
A fast mode decision based on GDV is presented in Han and Lee (2008). In this
scheme the base view—encoded using exhaustive RDO-MD—is used to segment
other views in foreground and background regions. The coding modes of the base
view are used to classify the image regions. SKIP and Inter 16 × 16 MBs are defined
as background while the remaining modes are considered foreground objects. As
the objects present displacement between views, the GDV is used to displace the
classified regions as well. Finally, the foreground regions are encoded using exhaus-
tive RDO-MD and the background regions are encoded using big block sizes.
The fast mode decision scheme of Shen et al. (2009a, b) considers the informa-
tion of reference view to classify the current MB in three complexity classes. For
that, the authors propose a mode complexity metric (MDC) defined as the sum of
each mode complexity in a 3 × 3 MBs window. SKIP and Inter 16 × 16 have “0”
complexity, Inter 16 × 8 and 8 × 16 have “1” complexity, and Inter 8 × 8 (or smaller)
and Intra have “2” and “3” complexity values, respectively. If the MDC is smaller
than a given threshold (T0), that regions is classified as simple. In the opposite, if
MDC exceeds another threshold (T1 > T0) it is classified as complex. Regions pre-
senting MDC between these thresholds are defined as medium complexity. The
simple regions test only Inter 16 × 16 mode. Medium regions evaluate Inter 16 × 16,
16 × 8, and 8 × 16 modes. Complex MBs are encoded using the exhaustive RDO-MD.
In Zeng et al. (2011) a fast mode decision approach is proposed based on the
classification of the current MB according to its motion activity based on the coding
42 2 Background and Related Works
modes of the base view. Firstly, the five motion-activity classes are defined in rela-
tion to the coding modes. SKIP belong to the motionless class (1). Slow motion
class (2) is defined for SKIP and ME 16 × 16. ME 16 × 8 and 8 × 16 are considered
Moderate Motion (3). Fast motion regions (4) are defined by ME 8 × 8 or smaller.
Finally, DE and Intra define High texture with fast motion or scene cuts (5). The
mode correlation-based mode decision (MCMD) metric is defined and calculated
using the 3 × 3 collocated MB window. Within this 3 × 3 neighboring MBs, each
neighbor MB has an offline defined weight. This MCMD metric is used to classify
the current MB motion activity in one of the classes described above. Independent
of the motion activity, the SKIP mode is firstly evaluated and an early termination
test is employed. If the SKIP prediction was not effective other modes are evaluated
according to the motion class. The same classification described above is used here.
For instance, A slow motion MB evaluates only the ME 16 × 16.
The work proposed in Chan and Tang (2012) exploits the statistical behavior of
the RDCost for the different coding modes along with the motion vectors difference
in order to speed up the MVC encoding. In this solution, an interactive mode deci-
sion is employed. Based on statistical knowledge showing the ME is used more
frequently than DE, the first interaction evaluates only the ME modes (all sizes).
If the ME-based prediction is not satisfactory, a second interaction is used to evalu-
ate the ME modes. However, only the block sizes that presented better coding
performance for ME are evaluated for DE in the second interaction.
State-of-the-art schemes mainly achieve the complexity reduction via fast MD.
However, they do not exploit the full space of neighborhood correlation in all spa-
tial, temporal, and view domains. These schemes deploy fixed thresholding (Han
and Lee 2008; Shen et al. 2009a, b) and, consequently, are unable to react to the
changing QPs (i.e., changing bitrates). Moreover, in their worst case, state-of-the-
art schemes—like (Han and Lee 2008; Shen et al. 2009a, b)—check all prediction
modes, thus falling back to the exhaustive RDO-MD. As a result, these schemes
provide limited complexity reduction.
In general, state-of-the-art schemes consider reference view encoded using the
exhaustive RDO-MD and employ their fast MD scheme on the other views. These
schemes prioritize the frames from the base view and the encoded quality of other
views relies on the one encoded using exhaustive RDO-MD. This might lead to
meaningful prediction error increase for the last views.
To find a single optimal/good matching block the ME/DE performs several block-
matching operations in multiple reference frames. Additionally, this search is repli-
cated for multiple block sizes defined by the MVC standard. However, there are
search directions (ME or DE), reference frames, and reference regions that are
highly unlikely to provide a good matching. Also, there are suboptimal points that
provide similar results at the cost of much reduced searching effort. See the example
in Fig. 2.19a. A good matching for the diamond object is available in just one of the
2.8 Energy-Efficient Algorithms for Multiview Video Coding 43
a b
four reference frames, the past temporal reference. In the future temporal reference
the diamond is partially occluded by a second object (rectangle). In the disparity
references the diamond is either occluded or out of the captured scene. In this sce-
nario there is no need to perform searches in all reference frames, resulting in com-
plexity/energy reduction. Other example is depicted in Fig. 2.19b. Note that the
previously encoded neighboring MBs share a similar motion/disparity vector since
they belong to the same object. Therefore, the current MB, which also belongs to
the same object, is very likely to share a similar vector. This knowledge may be used
to reduce the number of search operations by reducing the number of candidate
blocks. A wide range of techniques to reduce the ME/DE complexity are available
as presented in the following.
State-of-the-art fast ME/DE algorithms employ variable search range based on
disparity maps (Xu and He 2008) taking into account the distinct behavior between
ME and DE. The work presents an study on how the search window size impacts in
the coding efficiency showing the importance of big search windows. However,
disparity maps show that it is possible to reduce the effective search window by
monitoring the disparity maps. From the disparity maps two parameters named ver-
tical and horizontal scales (VS, HS) are defined. From the parameters the search
window is reduced or increased in an asymmetric way, i.e., the search window may
assume rectangular shapes. The increase and reduction are done in a factor 2.
In Kim et al. (2007) two strategies are used to predict motion and disparity vec-
tors. One vector predictor used the traditional spatial median predictor from upper,
left, and upper right neighboring MBs. The other predictor used the camera geom-
etry and vectors from previously encoded frames to estimate the current vectors.
The difference between the two predicted vectors is used to calculate the search
window size. A small difference means accurate predictors and a small search win-
dow is required. Otherwise, for big differences a larger window is needed.
A fast direction prediction (ME or DE) based on the blocks motion intensity is
proposed in Lin and Tang Angela (2009). It exploits the inter-view correlation to
predict a search direction for reducing the ME/DE complexity. The base view is
encoded using the ME and the frame regions are classified as slow motion if the
SAD is smaller than a threshold. Similarly, the anchor frames of all views are
44 2 Background and Related Works
Techniques to reduce the complexity and energy consumption of the video encoder
(such as fast mode decision and motion/disparity estimation) typically lead to video
quality losses. To control the quality losses rate control methods may be employed
through QP adaptation. Several rate control schemes are found in the current litera-
ture. Mostly they are developed targeting single-view encoders such as H.264.
Recently, a few works specific to the MVC standard have been proposed focusing
on frame- and BU-level RC. In this section we present an overview of the state of
the art on rate control.
In the single-view domain the majority of recent proposals are extensions of the
RC implemented in the H.264 reference software that employs a quadratic model
for mean absolute differences (MAD) distortion prediction (Li et al. 2003). However,
the quadratic model leads to limited control performance, as discussed in Tian et al.
(2010). Aware of this limitation, the authors in Jiang et al. (2004) and Merritt and
Vanam (2007) propose improved MAD prediction techniques. The scheme pre-
sented in Kwon et al. (2007) implements both distortion and rate prediction models
while in Ma et al. (2005) the RC exploits rate–distortion optimization models. An
RC based on a PID (proportional–integral–derivative control) feedback controller is
presented in Zhou et al. (2011). In Wu and Su (2009), an RC scheme for encoding
H.264 traffic surveillance videos using regions of interest (RoI) to highlight regions
that contain significant information is proposed. In Agrafiotis et al. (2006), RoI is
used to highlight preset regions of interests using priority levels. However, single-
view approaches do not fully consider the correlation available in the spatial, tem-
poral, and view domains and, consequently, cannot efficiently predict the bit
allocation or distortion resulting in inefficient RC performance.
The early RC proposals targeting the MVC encoder are based on simple extension
of single-view approaches (Li et al. 2003) and are still unable to fully exploit multiv-
iew properties. Novel solutions, however, have been proposed and most of them are
limited to frame-level actuation. The solution in Yan et al. (2009b) uses an improved
MAD prediction that differentiates the frame types. Intra frames, P and B frames with
only temporal prediction, P and B frames with only disparity prediction, and B frames
with both temporal and disparity prediction feature distinct MAD prediction equa-
tions. Once the MAD is predicted, the target bitrate is predicted for the GGOP, refined
to the GOP, and finally defined for each frame. An appropriate QP for each frame is
defined based on the target bitrate. This work is extended in Yan et al. (2009a) by
employing a technique to define the first QP in the GGOP; it is used to encode the
I frame. But these solutions are unable to properly handle the complex HBP structure
of MVC limiting the number of input samples and the rate control learning.
The authors of Xu et al. (2011) define an pyramid-based priority structure
extracted from the MVC HBP. The higher pyramid levels are used as reference
to encode lower pyramid level, e.g., I and P frames belong to the highest level,
B frames that refer to I and P frames belong to the second highest level, and so on.
The higher levels are prioritized and are encoded using lower QPs (high quality) in
46 2 Background and Related Works
order to reduce error propagation. This solution, however, considers a fixed HBP
structure and does not exploit the inter-GOP correlation.
To deal with distinct image regions within a frame there is a need for a BU-level
RC. Moreover, in order to find a global optimal solution, a joint frame- and BU-level
rate control scheme must be designed. Recent works have proposed solutions for the
BU-level RC in MVC. In Park and Sim (2009) is presented a solution that deals with
the frame-level and Macroblock (or BU)-level rate control. Firstly, the rate for each
view is calculated based on weight parameters defined by the user. After that, the QP
for each GOP is defined using the traditional H.264-based approach (Li et al. 2003)
followed by a QP refinement for each frame. The frame-level QP definition consid-
ers the HBP coding structure to prioritize frames in higher hierarchical levels.
A MAD-based strategy is used to calculate the target bitrate at MB level and a rate–
distortion model (not described in the paper) is employed to define the QP for each MB.
The authors in Lee and Lai (2011) consider the HVS properties to propose a
BU-level rate control solution that prioritizes the regions that are visually more
important to the observer. For that, they define regions of interest using the Just-
noticeable difference (Liu et al. 2010) metric along with luminance difference and
edge information. Depending on the relation between these metrics, the QP is
increased or decreased in relation to the initial QP (maximum QP in the neighbor-
hood). However, this solution does not employ feedback-based control and just con-
siders the coding information from one reference frame.
Generally, the available rate control techniques cannot fully exploit the correla-
tion potential available in the spatial, temporal, and view domains of MVC. In addi-
tion, they are unable to adapt to multiple HBP structure and cannot employ the
inter-GOP periodic behavior for RC optimization. Moreover, at the best of our
knowledge, no work has proposed a Rate Control scheme for MVC able to jointly
consider frame and BU level in a hierarchical and integrated fashion.
In this section are presented the background concepts required to understand the
rate control solution proposed in this monograph. Firstly, are presented the control
theory basics behind the model predictive control (MPC) used for the frame-level
RC. On the following, we present the statistical foundation supporting the Markov
decision process (MDP) that is implemented in our BU-level RC. Finally, the con-
cepts related to reinforcement learning (RL) are introduced.
Past Future
Reference Trajectory
Predicted Output
Measured Output
Predicted Control Input
Past Control Input
Prediction Horizon
may significantly vary and the selected control method must ensure the stability of
the given system. Thus, the selection of a control method for a given dynamic sys-
tem may be very challenging. In case the controller does not fit the system it may
compromise the stability of the entire system.
Among state-of-the-art control methods, the MPC has gained prominence by
being able to accurately predict and actuate on a dynamic multivariable systems. It
represents not a single control algorithm but a controller design scheme applicable
to distinct systems including continuous or discrete in time, linear or nonlinear,
integrated or distributed systems. MPC outperforms conventional feedback control-
lers (like PID) by keeping explicit integration of input and state constraints while
considering state space constrains. Also, MPC can dynamically adapt to new con-
texts by employing rolling input and output horizons (see more details below).
The main goal of the MPC is to define the optimal sequence of actions to lead the
system to a desired and safe state by considering the system feedback to previous
states and previously taken actions (see conceptual MPC behavior in Fig. 2.20). To
define this sequence of actions the MPC minimizes the performance function pre-
sented in Eq. (2.10). It minimizes the cost by defining a set of outputs y based upon
a set of inputs u. Where u[k + i − 1|k], i = {1, …, m} denotes the set of process inputs
with respect to which the optimization is performed; u is known as the control
horizon or input horizon in the MPC theory. Analogously, y[k + 1|k], i = {1, …, p}
is the set of outputs, named prediction horizon or output horizon (see Fig. 2.20).
The control horizon determines the number of actions to find. The prediction hori-
zon determines how far the behavior of the system is predicted. m and p are the size
of control/input and prediction/output horizons, respectively. m is the number of
measured outputs (history size) used for the optimization process ,while p defines
how many outputs are predicted; that is, how many future actions are considered in
the optimization processes. k is the horizons index and represents the kth input/
output horizon. ySP defines the output set point that limits the prediction horizon:
p m
min ∑ w ( y[k + 1 | k ] − y
u [ k | k ]…u [ k + p −1| k ] i =1
i ) + ∑ ri ∆u[k + i − 1 | k ]2 .
sp 2
(2.10)
i =1
48 2 Background and Related Works
s - State
a - Action
r - Reward
p - Probability
s2 i - State Index
j - Action Index
b Controlled System
Transition
Function
s
State s’ t
Reward
Function
Reward r’
Action a
r
s Control Agent
t+1
shared reward Rat(s, s′), as shown in Fig. 2.21b. The rewards are used by the decision
maker in order to find an action that maximizes, for a given policy, the total accumu-
lated reward, as shown in Eq. (2.11) (where 0 ≤ γ ≥ 1 denotes the discount factor):
∑g t
Rat (st , st +1 ). (2.11)
t =0
By definition, the Markov process is considered a controlled Markov process if the
transition probabilities P(S) can be affected by an action. Equation (2.12) defines the
probability Pa that an action a in the state s at time t will lead to state s′ at time t + 1:
Within a video frame there may exist multiple regions or objects with distinct
image properties and distinct importance for the observer. The image regions that
are considered, for some reason, more important are called Regions of Interest.
In this monograph, we consider all regions of semantically equal importance
50 2 Background and Related Works
The MVC is the most efficient video coding standard focusing on 3D-video coding.
It is able to provide 20–50 % of coding efficiency increase, if compared to H.264
simulcast, by employing inter-view prediction, the disparity estimation. Mode deci-
sion and motion and disparity estimation represent the most complex modules in the
MVC encoder and bring big challenges for their real-world implementation.
The implementation of MVC encoders may exploit different multimedia pro-
cessing architectural solutions. Currently, the most preeminent alternatives are mul-
timedia processors/DSPs, reconfigurable processors, ASICs, and heterogeneous
multicore SoCs. Each solution presents positive and negative points. On the one
hand, ASICs provide the highest performance and energy efficiency at the cost of no
flexibility. On the other hand, multimedia processors/DSPs are totally flexible but
deliver low performance and reduced energy efficiency. Heterogeneous multicore
and reconfigurable processors provide trade-off points between ASICs and proces-
sors. By employing units specialized in each kind of task, the heterogeneous multi-
core SoCs improve the performance in relation to multimedia processors but
typically present issues related to programming and portability. Reconfigurable pro-
cessors can cover this gap by employing extensible instruction set and defining, at
www.allitebooks.com
2.10 Summary of Background and Related Works 51
run time, if regular or custom instructions should be used in that specific time
instant. Still, these solutions are unable to meet the performance and energy effi-
ciency required for MVC encoding without application-specific ASIC acceleration.
Therefore, considering the current technology, a complete ASIC encoder or hetero-
geneous SoCs with hardware specific accelerators are seen to be the most feasible
solutions for embedded mobile devices.
Multiple proposals targeting on complexity and energy reduction for the MVC are
available in the current literature. These contributions are centered in two abstraction
levels: the algorithmic and architectural levels. At the coding algorithms perspective,
complexity reduction is most frequently addressed at the mode decision and motion
and disparity estimation because they represent the most complex MVC blocks. The
mode decision solutions used distinct side information in order to reduce the number
of coding modes tested during the coding process. Video properties such as texture,
edges, luminance, and motion/disparity activity are used to predict the most probable
coding modes in each image regions. Additionally, extensive analysis has been done
to learn how neighboring views and frames are correlated. This correlation is also
useful to predict the coding modes. As the ME/DE spends about 90 % of the total
encoding time, the same kind of information is used to predict the most probable
motion and disparity vectors and reduce the ME/DE complexity. However, the related
works do not fully exploit the correlation available within the 3D-neighborhood and
perform badly under content changing scenarios. Moreover, these solutions are not
developed considering the energy perspective and cannot react to battery-level chang-
ing situations by dynamically adapting the complexity to the available energy.
Generally, the complexity reduction techniques lead to uncontrolled quality deg-
radation and coding efficiency losses. The rate control becomes a key task in order
to minimize this complexity reduction drawback. The majority of rate control solu-
tions currently available target the H.264 or are simple extensions from H.264 solu-
tions. The few rate control algorithms designed for MVC focus only on frame-level
or basic unit-level actuation levels. Additionally, theses algorithms do not use the
intra- and inter-GOP bitrate correlation in the 3D-neighborhood.
At the hardware architectural perspective ME/DE is the most studied MVC cod-
ing block. The ME/DE is a processing and memory-intensive task requiring mas-
sively parallel processing and efficient memory access and management. The
resulting high-energy consumption is mainly related to external memory access and
on-chip video memory size. Diverse related works propose ME/DE processing
hardware architectures, memory hierarchies, and data reuse techniques. However,
they share limitations related to the complexity reduction algorithms implemented
(leading to quality losses), excessive external memory accesses, or large on-chip
memory resulting in high energy. Moreover, most of the available architectures lack
the ability to dynamically adapt its operation according to changing coding param-
eters or video content characteristics.
Therefore, there is a demand for novel and energy-efficient MVC encoding solu-
tions able to significantly reduce energy consumption under changing video and sys-
tem scenarios. For this reason, this monograph targets on jointly addressing the energy
issues at algorithmic and architecture levels while sustaining the video quality.
Chapter 3
Multiview Video Coding Analysis for Energy
and Quality
The Multiview Video Coding (MVC) standard brings high coding efficiency gains
reducing the bitrate in 20–50 % for similar video quality if compared to the H.264/
AVC simulcast. The coding efficiency gains are driven by novel high-complexity
coding tools that drastically increase the overall encoding processing effort and,
consequently, the energy consumption. In this section an extensive analysis of the
energy requirements for real-time MVC encoding and the energy consumption
breakdown are presented. The goal is to provide a better comprehension on the
MVC performance and energy requirements. Additionally, the requirements in
terms of objective video quality are discussed in the following.
Encoding MVC at high definitions has shown to be an unfeasible task for mobile
devices when all coding tools are implemented without energy-oriented optimiza-
tions. State-of-the-art embedded devices are unable to provide the processing per-
formance or to supply the energy required by the MVC encoder. To demonstrate the
energy-related challenges to MVC encoding a case study is presented in the
following.
Figure 3.1 presents the energy consumption to encode a 4-view HD1080p video
sequence using the MVC encoder while considering four fabrication technologies.
Also, the battery draining time for the following state-of-the-art smartphone batter-
ies are presented: Apple iPhone4 (5.25 Wh), Nokia N97 (5.6 Wh), and Samsung
Galaxy S3 (7.8 Wh). Note, these smartphones are unable to attend the MVC con-
straint. Despite the technological scaling that provides meaningful energy reduction
for deep-submicron technologies, the energy consumption remains high considering
embedded devices constraints. For instance, let us analyze the best-case scenario
where a device is fabricated with a state-of-the-art 22 nm fabrication node and fea-
tures a 7.8 Wh battery as available in the latest Samsung Galaxy S3 (Samsung 2012)
30
Energy Consumption [Wh]
Total Energy (90nm)
Total Energy (65nm)
Total Energy (45nm)
20 Total Energy (32nm)
Total Energy (22nm)
5.6 Wh - 133.1 s
10 5.25 Wh - 346.6 s
5.25 Wh - 124.6 s
7.8 Wh – 526.9 s
5.6 Wh – 369.6 s
7.8 Wh – 189.5 s
0
60 120 180 240 300 360 420 480 540 600
Time [s]
released in Q3 2012. For a scenario where MVC encoder is the only task draining
the battery, only 526.9 s (8 min, 46.9 s) of recording would be possible before the
battery was completely drained. The presented battery life is not acceptable and
does not attend market and user requirements. Meaningful energy reduction tech-
niques are required to bring the MVC consumption to a feasible energy envelope.
For this, a better understanding on the energy consumption sources is required.
Figure 3.2 demonstrates the motion and disparity estimation task (ME/DE) is
responsible for about 90 % of the total energy. These numbers were measured using
the Orinoco (Krolikoski 2004) simulation environment and might present discrep-
ancies in the actual numbers. This simulation, however, is worth for a relative com-
parison of the energy consumed by MVC encoding tools. This numbers consider the
fast search algorithm TZ Search (Tang et al. 2010) as search pattern. Motion com-
pensation (2.5 %), deblocking filter (2.5 %), and intra-frame prediction encoder
(2 %) are the following in terms of energy consumption while representing less than
2.5 % each. Thus, reducing ME/DE consumption is of key importance to reach
energy efficiency. ME/DE consumption is directly related to the size of search win-
dow (SW), that is, the size of the region to perform the search. The increase in ME/
DE search window leads to energy increase due to increased number of matching
candidates and larger amount of data required (memory accesses) to perform the
task. Figure 3.3 quantifies the energy consumption for five distinct sized SWs.
Comparing the corner cases, a small search window [±16, ±16] to a big [±128,
±128] SW, the energy increases in a factor of 6.5×. From single-view knowledge it
is possible to affirm that there is no need for using SWs larger than [±64, ±64].
However, disparity vectors tend to have larger magnitude and the ME/DE task
requires increased SW to find these matching candidates. According to Xu and He
(2008), for a good disparity estimation performance in HD1080p video sequences,
the search window should be at least [±96, ±96].
3.1 Energy Requirements for Multiview Video Coding 55
91,0%
100
Energy Consumption
10
0.1
[±16] [±32] [±64] [±96] [±128]
Search Window Size
Fig. 3.3 MVC energy breakdown for multiple search window sizes
100
10
0
22 27 32 37
QP
Fig. 3.4 MVC energy for distinct mode decision schemes
Computation
Off-Chip Memory
45%
45%
100× energy consumption, as shown in Fig. 3.4. Obviously, the single mode MD is
not used in practice under penalty of poor quality and coding efficiency results.
Nevertheless, this example highlights the optimization space for energy-efficient
solutions in the MD control.
At the architectural perspective, computation and memory (external memory
access and on-chip memory storage) are the two energy consumption sources. Here,
dynamic and static energies are jointly considered. As shown in Fig. 3.5, the energy
breakdown is composed of 90 % memory-related energy consumption while 10 %
are represented by the computation itself. Typically, for a rectangular search win-
dow on-chip memory using Level-C data reuse (Chen et al. 2006), the on-chip
memory energy and external memory access are evenly distributed but may vary
according to design options (on-chip memory size, data-reuse scheme, etc.).
3.1 Energy Requirements for Multiview Video Coding 57
20
Normalized Coding 18 H.264 Simulcast
Complexity [x] 16 1 Ref. View
14 2 Ref. View
12 4 Ref. View
10
8
6
4
2
0
1 2 4 8
Number of Views
Fig. 3.6 MVC vs. Simulcast complexity
1.000
100
10
eight views using four reference views is 19× more complex than encoding a single
H.264 view. Even using two reference views, as current multiview systems, the
complexity exceeds in 14× the H.264 single-view complexity. To understand what
this complexity represents, it is important to consider that real-time H.264 encoding
for HD1080p still poses interesting challenges in the embedded devices develop-
ment and require application specific hardware acceleration (see discussion in Sect.
2.5.4). Moreover, according to Ostermann et al. (2004) the H.264 encoder is about
10× more complex than the MPEG-4 Part 2 encoder. If compared to the simulcast
encoding of eight views (8× compared to H.264 single view), the MVC is 1.75× and
2.37× more complex for two and four reference views, respectively.
The total encoder complexity is mainly concentrated in the motion and disparity
estimation (ME/DE) unit responsible for about 90 % of the total processing, as
depicted in Fig. 3.7. The deblocking filter (DF) and motion compensation (MC)
blocks are the more complex blocks after ME/DE. The MVC encoder complexity
measured from the JMVC (JVT 2009a) reference software without optimizations
leads to 2 GIPS (Giga Instructions per Second) for only 4-view real-time MVC encod-
ing at HD1080p resolution. This throughput is unfeasible even for high-end desktop
computers. For instance, the latest Intel Core i7 3960X (Bennett 2011) processor with
six physical cores running at 3.3 GHz is able to provide about 180 MIPS. Thus, the
state-of-the-art high-end processors are orders of magnitude below the performance
requirements for real-time MVC encoding if no application/architectural optimiza-
tions are performed. The task is even more challenging for embedded processors.
For energy-efficient MVC there is a need to drastically reduce the complexity.
Based on the presented observations, ME/DE and MD modules have the highest
potential for complexity reduction and, for this reason, are explored in this work.
Therefore, deep application knowledge is required to design efficient complexity
reduction algorithms able to avoid objective and subjective video quality losses.
3.1 Energy Requirements for Multiview Video Coding 59
The high complexity and memory requirements posed by the MVC encoder are not
the only challenges related to its realization. MVC energy consumption is unevenly
distributed along the time. Processing and memory energy components vary depend-
ing upon coding parameters, user’s definitions, system state, and video content.
These run-time variations make the MVC encoder design even more challenging. If
on the one hand, an under-dimensioned encoder leads to performance issues and
does not guarantee reduced energy consumption due the need of additional buffer-
ing. On the other hand, over-dimensioned encoders face underutilization and unnec-
essary energy consumption.
The MVC prediction structure is a dominant factor in terms of energy variation
once distinct frame types (I, P or B) present distinct processing and memory access
behaviors. I frames are the lightest frames once the ME/DE (that represents 90 % of
the encoder complexity) is completely skipped. P frames employ ME/DE to a single
direction. In this scenario the P frames search in a single reference frame. The B
frames require heavy processing, in comparison to I and P frames, and intense
memory access while executing ME/DE search in multiple reference frames/views.
In Fig. 3.9 the frame-level energy consumption for seven GGOPs is presented. Each
bar represents the sum of energies spent to encode the frames from all four views
that belong to the same time instant (i.e., S0Tx + S1Tx + S2Tx + S3Tx). GGOP bor-
ders (anchors), for the experimented prediction structure, have one I frame, two P
frames, and one B frame (that performs only DE once there are no temporal refer-
ences available), for more details see prediction structure in Fig. 2.7. Consequently,
GGOP borders drain reduced energy amount (1.5 Ws/frame), as shown in Fig. 3.9.
All other relative positions within the GGOP are composed only by B frames and
the energy consumption drastically increases in comparison to GGOP borders.
Typically, the center of GGOP is the energy hungriest time instant once temporal
references are far and more extensive motion search is required to find a good
matching. According to the experiment presented in Fig. 3.9, the energy consump-
tion may exceed 7 Ws/time instant in this case. It represents a 4.7× instantaneous
energy variation within the same GGOP.
Although prediction structure-related energy variations may be easily inferred
from the coding parameters, there is another important variation source that may not
be easily obtained, the video content-related variations. The video content varia-
tions occur at multiple levels (a) view level: distinct views may present distinct
video content such as textures, motion and disparity behavior; (b) frame level: video
3.1 Energy Requirements for Multiview Video Coding 61
0
10 20 30 40 50 60
Frame #
20
ME
15
10
5
250 500 750 1000 1250
Time (µs)
properties vary along the time; (c) MB level: within the frame distinct regions or
object may present distinct image properties. Figure 3.9 depicts the energy varia-
tions along the time. For instance, GGOP #6 (frames 41–49) drains reduced energy
amount if compared to previous GGOPs due to reduced coding effort resulting from
easier-to-encode video content. The MB-level variations are shown in Fig. 3.10 in
terms of ME memory requirements. The memory usage changes along the time
depending on the video content motion intensity. High motion regions present
increased memory usage in relation to low motion regions within the same frame.
Therefore, energy efficiency in MVC encoding requires the understanding of the
energy sources variations and the design of adaptive architectures able to manage,
at run time, the energy consumption while considering dynamically varying param-
eters (such as video content) and system state.
www.allitebooks.com
62 3 Multiview Video Coding Analysis for Energy and Quality
The large complexity and intense memory communication related to MVC pose a
series of challenges related to real-time encoding for high definitions mainly at the
embedded systems domain. Energy consumption represents the most challenging
issue related to embedded MVC encoding. Thus, there is a dire need for energy
reduction of the MVC video encoder through complexity and memory access reduc-
tion. Energy-efficient solutions must jointly consider optimizations at algorithmic
and architectural levels. Coupling deep application knowledge to intelligent employ-
ment of low-power design techniques is a key enabler for energy-efficient embed-
ded MVC encoder realization.
Based on the discussion presented along Sect. 3.1, the energy-efficient MVC
requires the following optimizations at algorithmic level:
• Energy-efficient mode decision scheme: The MVC defines an increased optimi-
zation space for the optimal prediction mode selection leading to high complex-
ity and energy requirements, as demonstrated in Sect. 3.1. An efficient fast mode
decision scheme is needed to reduce the optimization space through heuristics
able to accurately anticipate the coding mode selection. The neighborhood infor-
mation and image/video properties may provide hints to completely avoid the
evaluation of unlikely prediction modes
• Energy-efficient motion and disparity estimation: ME/DE is the most complex
and energy hungry module in the entire MVC encoder. Intelligent optimizations
in ME/DE lead to meaningful overall energy reduction. Energy-efficient ME/DE
may be reached by applying ME or DE elimination, search direction elimination,
motion/disparity vector anticipation, object motion/disparity field analysis, etc.
• Dynamic complexity adaptation: The energy-efficient MD and ME/DE can be
designed considering distinct strengths in order to handle the energy versus qual-
ity trade-off. Additionally, the MVC presents a dynamically varying behavior
along the time depending on coding parameters, user’s constraints, and video
content. An energy-aware complexity adaptation scheme must be able to predict
these variations and to react at run-time through reduction/increase of complex-
ity budget by setting MD and ME/DE parameters. The dynamic complexity
adaptation scheme may also exploit asymmetric coding properties such as the
binocular suppression theory (Stelmach and Tam 1999).
The energy-efficient algorithms described above must be designed considering
their impact in the architectural implementation. At architectural level, the energy-
efficient solution must employ:
• Low-energy motion and disparity estimation architecture: The ME/DE task
requires high throughput but typically allows a high level of parallelism. To
attend the throughput requirements at a reasonable frequency of operation while
reducing energy multiple levels of parallelism must be exploited including (a)
pixel-level, (b) MB-level, (c) reference frame-level, (d) frame-level, and (e)
view-level parallelisms. It allows operating in a reasonable range of operation
3.3 Objective Quality Analysis for Multiview Video Coding 63
frequency and voltage. The processing units should be designed to enable power
gating and/or DVS to adapt to the performance variations.
• Energy-efficient on-chip video memory hierarchy: Simply feeding the highly par-
allel ME/DE processing units while avoiding performance losses is typically a
very challenging task. The on-chip video memory, however, has to deal with the
high memory-related energy consumption and memory requirements variations.
For that, an accurate memory sizing strategy is required. Also, the on-chip video
memory must support partial power gating and/or DVS to adapt to memory
requirement variations while minimizing static energy consumption.
• Data-reuse and prefetching technique: Neighboring MBs tend to access repeated
times the same data from reference frames during the ME/DE process. To avoid
additional external memory access the reference data must be stored in the local
memory. However, increased local memory leads to increased static energy.
Hence, only the actually required data must be read from external memory and
stored locally. The energy-efficient MVC solution requires a memory-friendly
data-reuse technique able to reduce external memory access without employing
increased local memory. To avoid performance losses due to local memory
misses the required data must be prefetched accordingly. Thus, it demands an
accurate memory behavior predictor that understands the ME/DE search pattern.
Accurate prefetching becomes even more challenging for state-of-the-art adap-
tive and customizable search algorithms.
• Dynamic Power Management: Supporting power gating and/or DVS in memory
and processing units does not directly lead to energy savings. To reach energy
efficiency an intelligent dynamic power management scheme is required to
define the proper power states at each given time instant. The DPM must apply
deep application knowledge including offline statistical analysis, neighborhood
history, and image/video characteristics in order to accurately predict perfor-
mance and memory requirements and take proper action.
Addressing each challenge related to energy-efficient MVC brings a contribution
to the overall energy reduction. A balanced combination of energy-efficient tech-
niques may lead to drastic MVC energy reduction. The energy reduction, however,
shall not be built upon meaningful coding efficiency/video quality losses. Otherwise,
the use of JMVC over simulcast is no more justified. Video quality issues are dis-
cussed in details in the following section.
In the previous subsections the need for energy-efficient MVC encoding was moti-
vated and justified. To reach such efficiency, complexity reduction, efficient archi-
tecture, and efficient memory management techniques including run-time
adaptations are required. These techniques, however, may lead to undesirable rate-
distortion (RD) performance losses. In other words, the optimizations techniques
64 3 Multiview Video Coding Analysis for Energy and Quality
Normalized Energy
40
PSNR [dB]
100
36
10
32 0
1000 3000 5000 22 27 32 37
BR QP
may lead to reduced video quality for the same output bitrate. For simplicity, in this
section we discuss the impact of optimizations algorithms in terms of video quality
variation. However, it is necessary to keep in mind that, for a more general analysis,
the rate-distortion performance must be evaluated by jointly considering the objec-
tive video quality and the generated bitrate. The RD trade-off can be managed
through Quantization Parameter (QP) adaptation by employing an efficient rate
control (RC) scheme (see Sect. 2.3.4).
To enable the use of MVC in real-world solutions, its implementation must be
energetically feasible and the resulting video quality (for similar bitrate) must be
significantly improved in relation to previous coding standards applying simulcast-
based coding. According to this assumption, the energy reduction techniques must
aggressively reduce the total energy consumption at the cost of none or reduced
quality loss. Figure 3.11 (extended from Fig. 3.4) depicts the impact of some simpli-
fied mode decisions in terms of video quality versus bitrate. Take the example of the
“SKIP only” MD, which represents 1 % of the total coding energy compared to the
exhaustive RDO-MD at the cost of nearly 3 dB quality loss. Remarkably, this is not
a reasonable solution due to high-quality loss. According to the experiments pre-
sented in Merkle et al. (2009) and Oh et al. (2011), the MVC provides about 1 dB
quality increase in relation to H.264 simulcast. In case the energy-efficient optimi-
zations lead to a quality drop at the order of 1 dB there is no reason for using the
MVC. In this scenario multiple state-of-the-art H.264 encoders should be employed
avoiding the 1.75×–2.37× complexity increase (Sect. 3.1.1) driven by MVC in rela-
tion to simulcast. Intermediary solutions are also presented in Fig. 3.11 dealing with
the relation between energy and video quality. The same kind of energy vs. quality
observations is noticed in the ME/DE optimizations.
Additionally, the 3D video quality includes additional properties in relation to
the regular 2D videos. Blocking artifacts are severely undesirable in 3D videos and
must be avoided during the encoding process. Such artifacts may lead to problems
for intermediate viewpoints generation and/or to the stereo pair mismatch problem,
as described in (Stelmach and Tam 1998). Quality drop due to blurring effect
3.4 Quality-Related Challenges in Multiview Video Coding 65
Figure 3.12 presents the overview of this monograph contribution related to the
energy-efficient realization of MVC. The high-level diagram presents the algorith-
mic and the architectural contributions along with the conceptual contribution
related to the 3D-Neighborhood correlation. Each contribution is detailed in the
Chaps. 4 and 5, as pointed in Fig. 3.12.
The energy reduction and management algorithms, hardware architecture design,
memory designs, and data-reuse schemes are based on the application knowledge to
deliver more efficient results. In this monograph we define the 3D-Neighborhood
concept that is widely used to guide the algorithmic and architectural contributions
of this monograph. The 3D-Neighborhood is defined as the MBs belonging to neigh-
boring regions at spatial, temporal, and view/disparity domains. The analysis of the
3D-Neighborhood space is powerful information to better understand the MBs cor-
relation and to accurately predict the future MBs behavior, as detailed in Chap. 4.
Hierarchical
Fast ME/DE
Rate Control
Section 4.3
Section 4.4
Chapter 4
Energy-Efficient Architectures
Application-Aware Power Gating
Section 5.4
Dynamic Search
ME/DE Multi-Bank
Window
Hardware On-Chip
Formation-based
Architecture Video Memory
Data Reuse
Chapter 5.1 Section 5.4
Section 5.3
3.5.1 3D-Neighborhood
The 3D-neighborhood data are also analyzed at run time to perform the actual
predictions related to the fast mode decision, fast ME/DE algorithms, and rate con-
trol. Also, the data reuse and the memory requirements prediction used to control the
power states of the on-chip video memory employ the neighborhood knowledge.
In this section are presented the main energy-efficient algorithms proposed in this
monograph and detailed along the Chap. 4.
Multilevel Mode Decision-based Complexity Adaptation: We propose a novel dynamic
complexity reduction scheme for non-anchor frames in MVC. Our scheme exploits
different video statistics and the coding mode correlation in the 3D-Neighborhood to
anticipate the more probable prediction modes. It employs a candidate mode-ranking
mechanism reinforced with an RDCost-based neighbor confidence level to determine
the more probable and less probable prediction modes. The proposed scheme also
incorporates an Early SKIP technique that exploits the high occurrence of SKIP MBs
in order to reduce the MVC encoder complexity by considering the 3D-Neighborhood
correlation and image properties. Two complexity reduction levels named Relax and
Aggressive with different threshold equations are defined. These levels provide a
trade-off between energy/complexity reduction and video quality. To limit the propa-
gation of prediction error, the anchor frames are encoded using exhaustive RDO-MD.
In this case the prediction error is propagated less due to the availability of a better
prediction from the anchor frames of the neighboring GOPs.
The energy-aware complexity adaptation scheme for MVC targeting mobile
devices employs several quality-complexity classes (QCCs), such that each class
evaluates a certain set of coding modes (thus a certain complexity and energy
requirement) and provides a certain video quality. It thereby enables a run-time
trade-off between complexity and video quality. To support asymmetric view quality
and exploit the binocular suppression properties, views for one eye are encoded with
high-quality class and views for the other eye are encoded using a low-quality class.
Our complexity adaptation adapts the QCCs for different views at run time depend-
ing upon the current battery level.
Fast Motion and Disparity Estimation: Our fast ME/DE algorithm computes the
confidence of predictors (motion/disparity vectors of the neighboring MBs) in the
3D-Neighborhood to completely skip the search step. The predictors are classified
according to a confidence level and the search pattern is replaced by a reduced num-
ber of candidate vectors (up to 13). To exploit this knowledge, accurate motion and
disparity fields must be available. Therefore, at least one frame using DE and one
using ME must be encoded with a near-optimal searching algorithm. In our scheme,
to avoid a significant quality loss, all anchor frames and the frames situated in the
middle of the GOP are encoded using the TZ Search algorithm (Tang et al. 2010).
3.5 Overview of Proposed Energy-Efficient Algorithms and Architectures… 69
Once the motion and disparity fields are established, all remaining frames are
encoded based on predictors available in these fields.
Hierarchical Rate Control: The HRC for MVC employs a joint solution for the
multiple actuation levels of rate control. The proposed HRC employs a Model
Predictive Control-based rate control that jointly considers GOP-phase and frame-
level stimuli to accurately predict the bit allocation and define an optimal control
action at coarse grain. This guarantees smooth bitrate and video quality variations
along time and view domains while supporting any MVC hierarchical prediction
structure. To further optimize the bit allocation within the frames, the HRC imple-
ments a Markov Decision Process to refine the control action at BU level taking into
consideration image properties to define and prioritize Regions of Interest (RoI).
The fine-grained adaptation promotes an increase in objective and subjective video
qualities inside the frame. The target bitrate at each time instant is predicted based
on the bitrate distribution within the 3D-Neighborhood.
Thresholds Definition Methodology: The energy-efficient algorithms, mainly those
based on statistic-based heuristics, are very sensible to the thresholds. For this rea-
son we consider the threshold definition methodology as part of this work. Our
schemes employ QP-based threshold equations in order to guarantee proper reac-
tion to changing QP values and keep the energy efficiency. The thresholds for a
subset of QPs are derived from extensive correlation statistical analysis of the
3D-Neighborhood. Probability Density Functions (PDF) considering a Gaussian
distribution are typically used to model the coding properties distribution.
The QP-based threshold equations are then modeled and formulated using polyno-
mial curve fitting from the set of thresholds statically defined.
The computational and energy requirements demanded for optimal MVC encoding
are orders of magnitude beyond the reality of current embedded systems. As demon-
strated along this section, MVC optimal encoding requires up to 1000 BIPS while
current processors delivers about 180 MIPS. In this scenario, state-of-the-art batteries
would be able to power the MVC encoder for just a few minutes. Thus, there is a need
to reduce the MVC complexity and attack the main sources of energy consumption.
As quantified along this section, mode decision and ME/DE represent more than
90 % of MVC encoder consumption. Moreover, in the ME/DE block, the memory-
related energy is dominant in relation to the computation-related energy. Aware of
this behavior, a series of energy-oriented contributions are presented.
Along this monograph are presented energy-efficient algorithms and hardware
architectures to enable the real-world implementation of the MVC video encoder.
Among the algorithms are a Multilevel Fast Mode Decision and a Fast ME/DE
algorithm. These solutions employ the 3D-Neighborhood correlation to predict the
full RDO-MD or to avoid unnecessary ME/DE searches. Additionally, an Energy-
Aware Complexity adaptation algorithm is proposed to enable run-time adaptation
in face of varying coding parameters and video inputs. To avoid eventual quality
losses posed by these heuristic-based algorithms, a HRC is presented.
A motion and disparity estimation architecture is proposed in order to provide
real-time performance and increased energy efficiency to the most complex MVC
encoding block. The Fast ME/DE algorithm is considered along with on-chip mem-
ory design techniques to reduce energy consumption. Moreover, the on-chip mem-
ory is controlled by our Application-Aware Power Gating. The external memory
accesses are reduced by the Dynamic Search Window Formation algorithm.
www.allitebooks.com
Chapter 4
Energy-Efficient Algorithms for Multiview
Video Coding
The energy consumption in MVC encoding is directly related to the high computational
effort and the intense memory access driven by the data processing. Therefore, the
energy-efficient algorithms for the Multiview Video Coding proposed in this mono-
graph are based on complexity reduction and complexity control techniques.
Moreover, in addition to the energy consumption perspective, meaningful complex-
ity reduction is also required at the performance perspective in order to make MVC
real-time encoding feasible for real-world embedded devices.
This chapter presents the proposed energy-efficient algorithms targeting com-
plexity reduction for the Multiview Video Coding through fast mode decision and
fast motion and disparity estimation techniques. An energy-aware complexity adap-
tation algorithm designed to offer run-time adaptivity to changing scenarios (battery
level, user constraints, video content) of battery-powered embedded devices is fur-
ther presented. Aware of the rate-distortion losses posed by such complexity reduc-
tion techniques we also present a video-quality management technique to avoid
visual degradation. The quality management employs a rate control unit able to
maximize the video quality for a given target bitrate while providing smooth quality
and bitrate variations at spatial, temporal, and disparity domains.
The studies of correlation within the 3D-Neighborhood build the foundation for
all algorithms proposed in this chapter. These studies contemplate the analysis of
coding mode, motion and disparity fields, and bitrate allocation. Additionally, the
profiling of the mode distribution and motion/disparity vectors are key enablers for
energy-efficient solutions able to provide high complexity reduction at a negligible
cost in terms of coding efficiency.
The graph presented in Fig. 4.1 quantifies the mode distribution in anchor and non-
anchor frames of the Ballroom and Vassar sequences for various QP values (22–37).
In anchor frames the mode distribution follows the typical distribution trend of
H.264/AVC-based encoding at lower QPs (Huang et al. 2006), i.e., more Intra-
coded MBs at lower QP values and more SKIP and large block partitions of Inter-
coded MBs at higher QP values. On the contrary, for non-anchor frames, a major
portion of the total MBs (50–70 %) is encoded as SKIP for QP > 22. The percentage
of the SKIP-coded MBs goes up to 93 % (average 63 %) in Vassar, a well-known
test video which has slow-motion intensity. The second dominant mode is
Inter-16 × 16. Notice that, for QP > 27, the percentage distribution of the Intra-coded
MBs in non-anchor frames diminishes to less than 1 %.
The uneven mode distribution for non-anchor shows there is a great potential of
complexity reduction in the non-anchor frames if the SKIP or Inter-16 × 16 coding
mode are correctly predicted for a MB. In the following analysis, we show that vari-
ance/gradient information in conjunction with the coding mode and RDCost corre-
lation in the 3D-Neighborhood provides a good prediction of the SKIP and/or
Inter-16 × 16 coding modes.
80 80
Distribution
60 60
40 40
20 20
0 0
Inter Inter Inter Inter Inter 8x8 Intra Intra Inter Inter Inter Inter Inter 8x8 Intra Intra
SKIP 16x16 16x8 8x16 or Below 4x4 16x16 SKIP 16x16 16x8 8x16 or Below 4x4 16x16
Coding Mode Type Coding Mode Type
T0 T1 T2 T8
S0
S1
S2
Intra-coded MBs
Inter-coded MBs Upscaled non-anchor
picture of S0T1
SKIP-coded MBs Non-Anchor Pictures
The first observation provided by Fig. 4.2 is the distinct mode distribution in the
anchor and non-anchor frames. It is noteworthy that the number of SKIP-coded
MBs is much higher in the non-anchor frames. This is due to the fact that a higher
correlation space is available for non-anchor frames compared to the anchor ones,
and consequently, there is higher likelihood to provide a better prediction employ-
ing the SKIP mode.
The upscaled frame (S0T1) of Ballroom sequence in Fig. 4.2 demonstrates that
most of the MBs in the background of the scene (spectators and wall) and partially
foreground objects (suits of the dancers and floor) of a non-anchor frame are encoded
using the SKIP mode. The MBs at the object borders (dancers) are encoded using
temporal/view-prediction modes (i.e., Inter-coded MBs) or spatial-prediction modes
(i.e., Intra-coded MBs). Only a few high-textured MBs containing moving specta-
tors in the background are encoded using spatial/temporal/view-prediction modes.
76 4 Energy-Efficient Algorithms for Multiview Video Coding
Note in Fig. 4.2 that the MBs belonging to the same region tend to use the same
coding mode when considering spatial, temporal, or disparity collocated MBs. For
instance, consider frame S0T1, the dancer borders share the same coding mode used
by the spatial neighboring MBs that belong to this border. Also, the same coding
mode tends to be shared with temporal and disparity collocated MBs in frames
S0T2 and S1T1, respectively.
However, different neighboring MBs in the 3D-Neighborhood exhibit different
amount of correlation to the current MB. Figure 4.3 shows the coding mode hits
(averaged over various QPs and video sequences) using the exhaustive RDO-MD.
A coding mode hit corresponds to the case when the optimal coding mode of a
neighbor is exactly the same as that of the current MB. Otherwise it is given as a
coding mode miss. The coordinates on the x- and y-axis correspond to the MB num-
ber in the corresponding column and row of a frame, e.g., (2,4) means the 2nd MB
of the 4th row. The eight neighbor frames in the 3D domain are evaluated and named
according to the cardinal points presented in Fig. 4.3. There are total 44 neighbors:
4 spatial, 18 temporal, 18 disparity, and 4 disparity-temporal. Note, the disparity
and disparity-temporal neighbors consider the GDV rounded to an integer number
of MBs.
Figure 4.3 illustrates that the spatial neighbors in the current frame exhibit the
highest coding mode correlation to the current MB (i.e., hits > 70 %) followed by
the disparity neighbors in the North and the South view frames (i.e., hits > 66 %).
The coding mode hits of the disparity neighbors is less than that of the spatial neigh-
bors due to the variations near the object borders and an inaccuracy in the GDV. The
lower number of hits for the temporal and disparity-temporal neighbors is basically
due to the motion properties. On overall, for non-anchor frames, in more than 98 %
of the cases the optimal coding mode of an MB is present in the 3D-Neighborhood.
It means that by testing the coding modes of all 44 neighbors it is highly probable
to find the optimal coding mode for the current MB (more than 98 % of the MBs in
the current frame). Moreover, due to the availability of a limited set of optimal cod-
ing modes in the non-anchor frames (typically much less than the number of modes
tested in an exhaustive RDO-MD), a significant complexity reduction may be
achieved.
As discussed above, there is a big potential of finding the optimal encoding mode
in the 3D-Neighborhood. However, a big number of different coding modes may
exist in this neighborhood. Thus, in order to reduce the number of probable modes,
additional information is needed. In this monograph we consider video and image
properties and the RDCost as additional information. The study related to these
properties is presented in the following sections.
Along our studies multiple video and image properties—including variance, bright-
ness, edges and gradient—were evaluated in order to provide useful information to
build fast mode decision algorithms. Among these properties, variance and gradient
T(n-1) T(n) T(n+1)
TIME
100 NorthWest 100 North 100 NorthEast
80 80 80
60 60 60
40 40 40
20 20 20
S(k–1)
y-1
0 y 0 y 0 y
Percentage Hits
y+1
(x-1)+GDV
x+GDV x+GDV x+GDV
(x+1)+GDV
S(k)
20 y-1 20 y-1 20 y-1
0 y 0 Current y 0 y
MB
y+1 x-1 y+1
Percentage Hits
x-1 x-1
x x x
x+1 x+1 x+1
100 SouthWest 100 South 100 SouthEast
80 80 80
60 60 60
S(k+1)
40 40 40
20 20 y-1 20
0 y 0 y 0 y
y+1
Percentage Hits
(x-1)-GDV
x-GDV x-GDV x-GDV
(x+1)-GDV
Spatial Neighbors View Neighbors Each Bar is an average hit value for 3 multiview sequences (Ballroom, Vassar, Exit)
Temporal Neighbors View-Temporal Neighbors encoded with the Exhaustive RDO-MD using four different QPs (22, 27, 32, 37)
VIEW
Fig. 4.3 Coding mode hits in the 3D-neighborhood
77
78 4 Energy-Efficient Algorithms for Multiview Video Coding
5 5 5
Skip Inter 16x16 Intra 16x16
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
256
256
VarMB = ∑ ( ri − rAVG ) 2 ; rAVG = ∑ ri + 128 8, (4.1)
i =1 i =1
æ 15 15 ¶f ö æ 15 15 ¶f ö
Dx = ç å å + 128 ÷ 256 , Dy = ç å å + 128 ÷ 256 ;
è i = 0 j = 0 ¶x ø è i = 0 j = 0 ¶y ø (4.2)
¶f ¶f
= r ( i, j ) - r ( i - 1, j ) , = r ( i, j ) - r ( i, j - 1)
¶x ¶y
Figure 4.4 shows different PDF (Probability Density Function) plots for the vari-
ance related to various coding modes. It is noticeable that the peaks for the SKIP
and Inter/Intra-16 × 16 modes are at 400 and 700, respectively. Therefore, MBs with
low variance are more likely to be encoded as SKIP than Inter/Intra-16 × 16. On the
contrary, MBs with high variance (1,500–2,500) are more likely to be encoded
using smaller block partitions. The PDFs for gradient are omitted; however, they
have a similar distribution to that of the variance.
Since there is a considerable overlap between the PDFs of 16 × 16 and smaller
block partitions, in order to obtain a more robust/accurate prediction about the cod-
ing modes, RDCost (see section “Analyzing the RDCost”) and coding mode corre-
lation in the 3D-Neighborhood are considered along with the variance and gradient
information.
4.1 3D-Neighborhood Correlation Analysis 79
a 10-4 b
104
1.2
8
6
0.8 4
2
0.4 0
0
15
MB(y) 20
0 104 30 40 MB(x)
0.5 1 1.5 2 2.5
c d 104
4
10
8 4
6
4 3
2 2
0
0 1
15
20
MB(y) 30 40 MB(x) 0 104
1 2 3 4
Fig. 4.5 (a) PDF for RDCost difference (between the current and the neighboring MBs) for SKIP
hit and miss; (b, c) Surface plots of RDCost difference for the SKIP coding mode hit and miss; (d)
RDCost prediction error for spatial neighbors
In order to determine which neighbor has a probable coding mode hit or miss, we
compute the difference between RDCost of a neighbor and the predicted RDCost of
the current MB (as the actual RDCost is not available before the RDO-MD process).
In the following we analyze the relationship between the RDCost difference and the
above discussed coding mode hit/miss. Figure 4.5a presents the PDF for the RDCost
difference for coding mode hits and misses in case of the SKIP mode. The PDF
shows that a SKIP coding mode can be predicted with a high probability of a hit
when the variance of an MB is low. Figure 4.5b, c shows MB-wise surface plot of
the RDCost difference (averaged over all frames of the Ballroom sequence) for hits
and misses, respectively. These plots demonstrate that most of the hits occur when
the RDCost difference is below 10 k, while the number of miss increases when the
value of RDCost difference goes above 70 k. This behavior also conforms to the
PDFs in Fig. 4.5a. This analysis shows that the value of RDCost difference provides
a good hint for a hit in case of a SKIP coding mode. Similar behavior was observed
in the hit and miss PDFs for other coding modes. Here, we discuss the PDFs for the
SKIP mode as an example since it is the dominant coding mode in non-anchor
frames, especially for higher QP values.
80 4 Energy-Efficient Algorithms for Multiview Video Coding
-4 -4 -4
1.5 10 1.5 10 1.5 10
Inter 16x8 Inter 8x8
Intra 4x4
1.0 Inter 8x16 1.0 and Below 1.0
Figure 4.6 presents the PDFs of predicted RDCost for different coding modes.
Variable shapes of the PDFs already hint towards the exclusion of improbable mode
for a given value of the predicted RDCost. Since a good prediction is important to
determine a near-optimal coding mode, we have evaluated the accuracy of the pre-
dicted RDCost and optimal RDCost to analyze the risk of misprediction.
Once the RDCost is not available without the exhaustive RDO-MD we tested
different predictors for the current MB RDCost in the 3D-Neighborhood. After ana-
lyzing the mean and median RDCosts predictors, we have determined that the
median RDCost of the spatial neighbors [see Eq. (4.3)] provides the closest match
to the optimal RDCost. In Eq. (4.3) S L , ST , and STL represent left, top, and top/left
spatial neighbors, respectively:
Our detailed analysis illustrates that it is possible to accurately predict the optimal
coding modes, mainly for non-anchor frames, if the coding mode distribution, video
4.1 3D-Neighborhood Correlation Analysis 81
Optimal RDcost
10-4
Missprediction
2.0
10-4 Object
Borders
2.5
1.5
1.5 1.0
0
10
0.5
0.5
20
0
30 0
0 10 20 30 40
MB(x)
Fig. 4.7 Average RDCost prediction error for spatial neighbors in Vassar Sequence
T0 T1 T2 T3 T4 ... T8
S0 I B B B B ... I
S1 B B B B B ... B
S2 P B B B B ... P
S3 B B B B B ... B
Anchor Non-Anchor Anchor
Fig. 4.9 MV/DV error distribution between predictors and optimal vector (Ballroom, Vassar)
vector, i.e., MV/DV error) in the 3D-Neighborhood (i.e., spatial, temporal, and view
domains). A set composed of one spatial median predictor, six temporal predictors,
and six disparity predictors is analyzed. The temporal predictors are selected from the
previous and next frames (in the displaying order) called West and East neighbor
frames, respectively. For each neighboring frame, three predictors are calculated.
They are (a) the collocated MB (MB in the reference frame with the same relative
position of the current MB), (b) median up (using the median formula specified by the
MVC standard (JVT 2009b) to calculate the spatial predictor), and median down
(median of A*, B*, C*, and D* as shown in Fig. 4.8). The disparity predictors from
the North and South neighboring view frames are obtained by considering the GDV.
Figure 4.9 illustrates the MV/DV error distribution for Vassar (low motion) and
Ballroom (high motion) test video sequences in the 3D-Neighborhood. Each plot
represents the difference between a given predictor (in this case for the spatial pre-
dictor and three collocated predictors in different neighboring frames) and the opti-
mal vector of the current MB. It shows that for the majority of the cases, the predictor
vectors have similar values in comparison to the optimal vector. Even, most of the
predictors have exactly the same value of the optimal vector. Our analysis shows
that this observation is valid for the other nine predictors in all directions of the
3D-Neighborhood as depicted in Fig. 4.9 (only few error plots are shown here).
To quantify the MV/DV error distribution in the 3D-Neighborhood, several
experiments were carried out to measure the frequency in which a given predictor is
equal to the optimal vector (i.e., MVPred = MVCurr). When this condition is satisfied,
84 4 Energy-Efficient Algorithms for Multiview Video Coding
Table 4.1 Predictors hit Predictor Neighbor frame Hit [%] Available [%]
rate and availability
Spatial n.a. 94.12 99.90
All West 96.94 51.52
East 97.93 60.30
North 97.94 65.40
South 98.67 21.29
Collocated West 58.43 99.90
East 66.79 99.90
North 95.39 72.39
South 96.75 23.48
Median Up West 54.74 99.90
East 63.78 99.90
North 93.17 73.99
South 94.61 23.98
Median Down West 54.99 99.89
East 63.92 99.89
North 93.21 74.13
South 94.70 23.93
2.3
2.0
View 0 View 1 View 2 View 3 View 4 View 5 View 6 View 7
Bit-Allocation [Kbits]
Bit-Allocation [Kbits]
100-120 80-100
60-80 40-60
20-40 0-20
The bitrate distribution at frame level presented in Fig. 4.11 shows that inside each
GOP the frames that present higher bitrate are located at lower hierarchical predic-
tion levels. This is related to the distance of temporal references; the farther the refer-
ence the more difficult is to find a good prediction. Therefore, more error is inserted
resulting in higher bitrates. In B-Views this effect is attenuated once this view is less
dependent on temporal references due to the higher availability of disparity refer-
ences. Figure 4.11 illustrates that for neighboring GGOPs the frames at same relative
position exhibit similar and periodic rate distribution pattern, the GOP-Phase.
Inside each frame the number of bits generated for each BU is also related to the
video content. Figure 4.12 shows that the homogeneous and low motion/disparity
4.2 Thresholds 87
background requires lower bitrate if compared to the dancers’ region and to the
textured floor for a similar quality. However, the Human Visual System (HVS)
requires a higher level of details for textured and border regions to perceive good
quality, and consequently, these regions deserve higher objective quality. Therefore,
textured regions must be detected and receive further increased number of bits dur-
ing the encoding process through QP reduction.
Summary: The frame-level bitrate distribution depends on the prediction hierarchy
and the video content of each frame. Due to correlation of video content, an effec-
tive rate control must consider the neighboring frames at temporal, disparity, and
GOP-phase domains. The video properties have to be considered at BU level in
order to locate and prioritize regions that require higher quality.
4.2 Thresholds
F ( m + ns ; m , s 2 ) − F ( m + ns ; m , s 2 ) = F (n) − F ( − n ) , (4.4)
µ
µ=average
s=std. deviation
Probability
TH =µ+σ
σ σ
Fig. 4.13 PDF showing the
area of high probability as the σ Property
shaded region
88 4 Energy-Efficient Algorithms for Multiview Video Coding
0
1 2 3 x104
RDCostSKIP
Note that the relation between an image property and the coding property to be
inferred from it varies with the changing quantization parameter (QP). This comes
from the fact that the QP parameter influences on the decisions taking by the
encoder. For example, considering the mode decision process, the QP changes the
quantization itself but also changes the l parameter [see Eq. (2.3)] used for balanc-
ing the quality vs. bitrate trade-off. Therefore, the presented statistical study needs
to be replicated for different QPs. To avoid the complete data extraction and statisti-
cal analysis for every single QP value (that goes from 0 to 51 while the practical use
typically goes from 22 to 37) we analyze a set of QPs and derive a generic equation
for every QP. The generic equation is derived using polynomial curve fitting.
To provide a practical example, we present the threshold derivation for the
RDCost property of SKIP MBs encoded using the exhaustive RDO-MD. Figure 4.14
shows the PDFs for the Vassar sequence encoded using various QP values. Notice
that the PDF for QP 27 shows a concentrated distribution centered in a relatively
low RDCost range, i.e., a small average (μ) and standard deviation (σ). Contrarily,
the PDFs for relatively high QPs (32–42) exhibit a low peak centered in a relatively
high RDCost range.
Recall that the goal is to define thresholds for detecting SKIP MBs with a given
confidence considering a Gaussian distribution for the RDCost. In this problem, the
confidence region is defined from zero to a threshold point defined in terms of aver-
age (μ) and standard deviation (σ)—THRDCost = μ + nσ—where n represents a multi-
plier factor for the standard deviation. Higher n represents that more MBs will
attend this condition. For the example presented in Fig. 4.13, the threshold was
defined as THRDCost = μ + σ (n = 1) and all MBs with RDCost < μ + σ belong to the
high-confidence zone represented by the gray-filled area in Fig. 4.13.The region of
confidence is defined as follows in Eq. (4.6):
( ) ( )
F m + s ; m , s 2 − F 0; m , s 2 ≈ 0.84.
(4.6)
Equation (4.6) shows that up to 84 % MBs belong to the high-confidence region.
We define these points of high probability as the RDCost thresholds for a set of QP
4.2 Thresholds 89
12000
8000
4000
0
20 25 30 35 40
QP Value
represented as diamond points in the Fig. 4.14. Different points for different QPs are
used to derive the QP-based threshold Eq. (4.7) using polynomial curve fitting in
order to extend to any QP value, as depicted in Fig. 4.15:
4.3 M
ultilevel Mode Decision-based Complexity
Adaptation
Firstly, the 3D-Neighborhood information is fetched and the RDCost for the current
MB is predicted using the spatial neighbors considering their high ratio of coding
mode hits [Eq. (4.9)]. A list of candidate prediction modes (CandidateList) is
formed from the 3D-Neighborhood. Each candidate mode is associated with a rank
value (RMODE). This value is calculated as the accumulated confidence level of the
neighbors with the similar coding mode (CLNBi(Mode); Eqs. (4.10) and (4.11)). This
confidence level of a neighbor is computed by evaluating the normalized difference
(NDiff) between its actual RDCost and the predictive RDCost for the current MB
[Eq. (4.12)]. Note that the confidence-level calculation depends upon the quality of
RDCost prediction (section “Analyzing the RDCost”). The candidate list is then
sorted according to the rank value [Eq. (4.13)]:
Multiview Videos
RDCost and
Prediction
Mode Data
3D-Neigborhood Analysis
RDCost-based Confidence
Start
Level Ranking
Evaluate High
RDCost > TH ET
Confidence Modes
Evaluate Low
RDCost > TH ET
Confidence Modes
Video Properties-based
RDCost > TH ET
Mode Prediction
Texture/Direction-based
RDCost > TH ET
Mode Prediction
Encode MB
12000 1080
8000 1050
4000 1020
0 990
20 25 30 35 40 20 25 30 35 40
QP Value QP Value
Fig. 4.18 Early SKIP threshold curves for (a) RDCost and (b) Variance
Based on the analysis of high SKIP MBs distribution in non-anchor frames (section
“Coding Mode Distribution Analysis”), our scheme employs an early SKIP predic-
tion. In case a SKIP mode is correctly predicted, significant complexity reduction is
obtained as the ME and DE are entirely skipped. This early SKIP mode prediction
is only performed if sufficient correlation is available in the 3D-Neighborhood. To
avoid a misprediction (that may result in significant PSNR loss) the early SKIP
prediction depends upon three conditions considering the mode rank, variance, and
RDCost, as presented in Eq. (4.14):
The QP-based thresholds for RDCost (THRDCost_ES) and variance (THVar_ES) were
obtained using the corresponding PDF analysis. The area of high probability is con-
sidered as the average plus one standard deviation. A threshold is thereby given as
TH = μ + σ. The PDFs for four different QP values are used to determine four thresh-
olds at different QPs. A QP-based threshold formulation is obtained using the poly-
nomial curve fitting. Figure 4.18 presents thresholds (THRDCost_ES and THVar_ES) for
four QPs and the corresponding curve fitting. The threshold for ranks (THR_ES) was
obtained (using an exhaustive analysis) as 15 % of the total confidence level accu-
mulated on the entire CandidateList (i.e., sum the ranks of all modes). The Early
SKIP is also discussed in (Zatt et al. 2010).
After the early SKIP mode prediction, the tested mode is evaluated for the early
mode decision termination. If the tested RDCost is bigger than the threshold THET,
4.3 Multilevel Mode Decision-based Complexity Adaptation 93
a
35.65 1000
FAST
35.60 5000
RDO
PSNR [dB] 35.55
10000
35.50
15000
35.45 20000
35.40
600 620 640 660 680 700
b Bit Rate [kbps]
80
Reduction [%]
Complexity
60
40
20
0
1000 5000 10000 15000 20000
Threshold Value
the mode decision proceeds to the next phase. Otherwise, the mode decision is
terminated and the best tested mode is used for encoding the current MB.
The threshold for early mode decision termination controls the achieved com-
plexity reduction and the resulting PSNR loss. An excessively high value threshold
provides high complexity reduction at the cost of severe PSNR loss. We have per-
formed an exhaustive analysis to determine these thresholds. Figure 4.19 shows the
RD curve for five different test threshold values and their corresponding complexity
reduction (bars) for QP = 32. It is noted that THET = 5,000 provides minimal PSNR
loss and low complexity reduction, while THET ≥ 10,000 provides a high complexity
reduction at the cost of considerable PSNR loss (i.e., >0.15 dB). In order to provide
a trade-off between achieved complexity and the resulting quality loss, we propose
two complexity reduction levels or complexity reduction strengths:
• Relax complexity reduction: it provides a reasonable complexity reduction while
considering a low PSNR loss.
• Aggressive complexity reduction: it provides a high complexity reduction at the
cost of a slightly higher PSNR loss (but still visually unnoticeable in many cases,
as we will show in results chapter).
From an exhaustive analysis of various multiview sequences (encoded using
exhaustive RDO-MD), we obtained the plots and QP-based equations for Relax
(blue) and Aggressive (red) complexity reduction (see Fig. 4.20).
This early termination is employed after each phase of our dynamic complexity
reduction scheme as explained in the subsequent sections.
94 4 Energy-Efficient Algorithms for Multiview Video Coding
16
Fig. 4.20 Early termination threshold plots for Relax (blue) and Aggressive (red) complexity
reduction
The modes in the sorted CandidateList are partitioned into high-confidence and
low-confidence modes using THHighCL. The threshold THHighCL is determined (using
an exhaustive analysis) as 25 % of the total confidence level accumulated on the
entire CandidateList. First, all of the high-confidence modes (i.e., RMODE ≥ THHighCL)
are evaluated. Afterwards, the condition for early termination is evaluated. If the
condition is not satisfied, all of the low-confidence modes (i.e., RMODE < THHighCL) are
evaluated. If the termination condition is not satisfied after evaluating the low-
confidence modes, the mode decision proceeds to the next phase.
As discussed in Fig. 4.1, SKIP and Inter-16 × 16 are the two most occurring modes
in the non-anchor frames. In case sufficient correlation is not available in the
3D-Neighborhood, the variance property of a frame is considered to evaluate SKIP
and Inter-16 × 16 coding modes (in case these were not evaluated in the previous
phases). The thresholds used in the conditions of this phase are derived using the
PDFs presented in section “Analyzing the Video Properties” considering the region
of high probability as discussed in Fig. 4.13.
In the last phase a texture direction based prediction is employed to evaluate modes
other than SKIP and Inter-16 × 16 (if they were not tested in the previous phases).
The direction of the gradient is considered to exclude improbable modes.
4.3 Multilevel Mode Decision-based Complexity Adaptation 95
Besides the algorithms able to perform the fast mode decision, a complexity adapta-
tion algorithm is required to adapt the mode decision at run time according to the
changing application scenarios. Targeting MVC encoding systems where battery
level, user constraints, and video content may vary widely along the time, we pro-
pose in this section an energy-aware complexity adaptation for MVC targeting
mobile devices. Our algorithm, also presented in (Shafique et al. 2010b), employs
several Quality-Complexity Classes (QCCs), such that each class evaluates a certain
set of coding modes (thus a certain complexity requirement) and provides a certain
video quality. To support asymmetric view quality, views for one eye are encoded
with high-quality class and views for the other eye are encoded using a low-quality
class. Our algorithm adapts the QCCs for different views at run time depending
upon the current battery level.
T0 T1 T2 T3 T4 ... T8
I B B B B ... I S0
Even Views
High Quality
B B B B B ... B S1
P B B B B ... P S2
Odd Views
Lower Quality
B B B B B ... B S3
Anchor Non-Anchor Anchor
viewer sees one high-quality and one low-quality view resulting in a perception near
to the high-quality view. The use of high quality in even views is explained by the
fact they are used as reference to odd views.
Although in (Stelmach and Tam 1999) the low-quality frames were synthetically
blurred for analysis, this knowledge can be extended to a real scenario and applied
in techniques to reduce the MVC coding complexity. In our scheme, the odd views
will be submitted to more aggressive mode decision resulting in a lower quality in
relation to their neighboring views.
In the following section is presented the energy-aware complexity adaptation
algorithm besides the Quality-Complexity Classes (QCCs) and Quality States (QS)
description.
In order to employ the asymmetric view quality and the battery-level sensitivity to
our scheme we define three Quality-Complexity Classes (QCCs).
QCC 1: MBs of QCC 1 are exposed to the more aggressive mode decision of our
scheme including SKIP and Inter 16 × 16 modes. Therefore, they have the lowest
video quality and the higher complexity reduction.
QCC 2: This class presents the intermediate video quality and complexity reduction.
Modes of QCC 1 plus Intra 16 × 16, Inter 16 × 8, 8 × 16, and 8 × 8 are evaluated.
QCC 3: More computationally intense class, and consequently, the one that pro-
vides better video quality. It includes the coding modes available in QCC 1 and
QCC 2 plus small blocks such as Intra 4 × 4, Inter 8 × 4, 4 × 8, and 4 × 4.
Figure 4.22 presents the high-level diagram of our scheme showing the mode
decision flow. The QCCs are related to three different prediction phases according
to the dashed blocks in Fig. 4.22. QCC 1 is subject to Phase 1; QCC 2 to Phase 1
and Phase 2; and QCC 3 is subject to Phase 1, Phase 2, and Phase 3. However, even
4.3 Multilevel Mode Decision-based Complexity Adaptation 97
Phase 1
SKIP and Inter 16x16 EPTZ
Ph1
QCC1 or EPTZ
RDCost < EPTZPh1
QCC 1
Phase 2
Intra 16x16,
Inter 16x8, 8x16 and 8x8 EPTZ
Ph2
QCC2 or EPTZ
RDCost < EPTZPh2
QCC 2
Phase 3
Intra 4x4,
Inter 8x4, 4x8 and 4x4 QCC 3
for QCC 2 and QCC 3 big part of MBs are SKIP or Inter 16 × 16 and there is no need
to test small block sizes. For this reason, the early prediction terminator zone
(EPTZ) was defined.
x10-4
QP27
Average(µRD) QP32
3 QP37
Average + Std.
QP42
deviation (µRD+σRD)
2
0
1 2 3 x104
RDCostSKIP
Last MB Last MB
Batt. < 25%
Start End
Batt. > 25%+H
Last MB Last MB
Batt. > 10%+H
QS4 QS3
Batt. < 10%
( EMT ) Ph1
QCC 2
= ( EMT Ph 2 )
QCC 3
= RDCost AVG - 1.5 RDCostSD ,
(4.15)
( EMT ) Ph1
QCC 3
= RDCost AVG - 0.5 RDCostSD ,
(4.16)
( EMT )Ph1
QCC 2
(
= EMT Ph 2 )QCC 3
= 29.663QP 2 − 1409.1QP + 18766. (4.17)
Associated with the QCCs, our scheme employs four different Quality States (QS).
The QSs consider the binocular suppression theory (Stelmach and Tam 1999) using
asymmetric view quality and react, at run time, to the changing battery level. As
summarized in Table 4.2, QS1 presents the highest quality and encodes all views as
QCC3. In turn, QS2 and QS3 use the view quality asymmetry encoding odd view in
lower quality than even views. QS4 provides the lowest quality and highest com-
plexity reduction coding all views as QCC1.
The QS control is performed by one state machine that receives an indication
of the battery level as input. Figure 4.25 presents the transitions between the four
100 4 Energy-Efficient Algorithms for Multiview Video Coding
possible states. The quality states just change to the immediately superior or inferior
quality in order to have smooth video quality variation. The hysteresis (H) is fixed
as 5 % in order to avoid quick oscillations between different states, and conse-
quently, video quality fluctuations. This state machine can be easily adapted to con-
sider other external parameters such as user presets and time constraints.
The detailed results of our multilevel fast mode decision algorithm compared to the
RDO-MD solution implemented in the JMVC are presented along this section.
Table 4.3 presents the results for ΔPSNR, ΔBitrate, and complexity reduction (i.e.,
time saving, TS). For JMVC using the exhaustive RDO-MD the results are pre-
sented in coding time (column T, in seconds), PSNR (dB), and Bitrate (column BR,
in kbps). The values for a certain QP value are obtained by averaging over all eight
views. The last row named Average presents the average results over all sequence.
The experiments were performed for eight views considering IPB view coding
order. For more details on the experimental setup refer to Sect. 6.1.
Figure 4.26 illustrates the PSNR (lines) and time savings (bars) comparison of
Relax and Aggressive levels averaged over all views and QPs for Ballroom and Exit
sequences. It is noted that the difference between the RD curves of Relax and
Aggressive is more significant at low bitrates and this difference diminishes at higher
bitrates. The time savings of the Aggressive level are significantly higher compared
to Relax level at higher bitrates while providing slight RD difference. Relax scheme
was developed to keep video quality in all QP ranges and for this reason is forced to
reduce TS for big and small QP ranges presenting higher TS for intermediate QPs.
In Aggressive scheme the higher TS is prioritized for the whole QP range.
A view-wise ∆PSNR and time savings comparison of Relax and Aggressive levels
is provided in Fig. 4.27 for the Exit sequence encoded using QP = 32. Odd views—
with north and south views (i.e., Views 1, 3, 5) available in the neighborhood—present
higher time savings compared to the views with just one (i.e., Views 2/4/6/7) or
none available neighboring views (i.e., View 0). Additionally, Views 1, 3, and 5 also
present a smaller PSNR loss. This higher complexity reduction and reduced PSNR
loss is due to the larger correlation space in the 3D-Neighborhood. It implies that
more neighboring MBs are available for the prediction. Consequently, a more accu-
rate CandidateList is generated.
The high time savings provided by the multilevel mode decision comes from the
reduced number of coding modes tested. Figure 4.28 provides the result of the
4.3 Multilevel Mode Decision-based Complexity Adaptation 101
Table 4.3 Detailed results for ∆PSNR, ∆Bitrate, and time savings compared to the exhaustive
RDO-MD
JMVC Proposed relax Proposed aggressive
Video T PSNR BR TS ΔPSNR ΔBR TS ΔPSNR ΔBR
sequence QP [s] [dB] [kbps] [%] [dB] [%] [%] [dB] [%]
Ballroom 22 2,682.53 41.111 3,176.849 54.77 0.005 2.923 59.03 0.002 5.820
27 2,490.47 38.415 1,319.736 61.23 0.039 3.015 70.12 0.038 10.150
32 2,315.22 35.667 654.338 57.04 0.039 0.934 65.71 0.084 3.060
37 2,121.62 32.884 360.162 52.67 0.025 0.453 63.07 0.065 1.500
Exit 22 2,671.07 41.601 2,114.453 60.29 0.006 3.937 67.68 0.045 7.510
27 2,268.79 39.456 652.491 71.36 0.016 2.402 80.08 0.099 12.090
32 2,065.18 37.508 292.673 70.19 0.030 1.101 78.10 0.109 5.910
37 1,900.21 35.293 163.436 67.92 0.043 0.357 78.37 0.123 2.060
Vassar 22 2,963.66 40.743 3,007.434 55.26 0.001 2.837 69.35 0.001 5.470
27 2,519.33 37.828 850.324 77.18 0.021 1.198 81.24 0.001 11.250
32 2,114.88 35.490 259.826 76.59 0.020 −0.028 82.44 0.055 3.450
37 1,827.57 33.294 117.428 74.69 0.008 −0.196 82.23 0.034 3.660
Race1 22 4,908.26 42.340 2,549.767 64.55 0.006 8.349 78.10 0.005 11.550
27 4,631.98 39.422 1,182.855 71.70 0.036 6.514 80.09 0.028 17.800
32 4,269.55 36.501 552.949 70.49 0.036 3.443 78.13 0.064 9.800
37 3,806.22 33.795 294.763 69.05 0.028 1.543 74.92 0.074 6.650
Rena 22 2,238.63 46.555 1,347.801 68.54 −0.205 12.283 67.09 −0.212 22.960
27 1,960.61 43.846 587.514 70.55 −0.215 14.289 70.44 −0.310 36.550
32 1,685.64 40.535 293.333 71.79 0.028 6.971 70.87 −0.220 33.740
37 1,452.22 37.396 163.581 66.58 0.043 3.038 73.03 −0.154 26.880
Akko&Kayo 22 2,644.20 43.53 1,743.05 65.67 −0.056 10.152 66.81 −0.050 14.770
27 2,560.85 40.79 808.48 69.66 −0.015 9.395 74.27 −0.020 21.920
32 2,466.77 37.59 433.65 65.66 0.036 3.130 71.05 0.000 15.280
37 2,320.73 34.45 254.24 59.72 0.023 1.597 70.02 0.020 8.970
Breakdancers 22 5,893.46 41.449 4,899.089 53.81 0.002 5.393 62.39 0.004 7.150
27 4,817.77 39.841 1,454.553 63.14 0.038 7.492 76.10 0.061 12.910
32 4,116.45 38.432 667.955 62.98 0.054 7.861 74.95 0.111 11.700
37 3,487.59 36.629 378.932 59.53 0.094 3.615 76.11 0.237 7.150
Uli 22 4,826.40 40.476 8,152.865 44.35 −0.001 2.024 47.53 0.007 3.545
27 4,638.51 38.591 3,801.339 61.34 0.013 3.245 66.62 0.048 5.488
32 4,326.23 36.238 2,056.013 57.22 0.002 2.115 61.41 0.037 4.280
37 3,945.21 33.554 1,162.239 61.54 0.078 3.676 68.11 0.137 7.150
Poznan_ Hall2 22 10,737.88 42.9111 5,149.584 63.48 −0.003 5.649 67.68 0.045 7.510
27 8,911.15 41.693 1,417.539 69.02 0.016 5.124 80.08 0.099 12.090
32 7,778.938 40.303 693.789 65.46 0.020 3.913 78.10 0.109 5.910
37 6,702.69 38.606 420.110 61.56 0.047 1.269 78.37 0.123 2.060
GT_Fly 22 6,334.83 41.247 6,437.935 55.16 0.017 9.547 64.68 0.028 12.975
27 5,168.38 39.705 2,029.539 59.98 0.042 8.832 72.68 0.052 10.685
32 4,511.77 38.280 946.270 62.40 0.044 6.562 67.60 0.079 8.655
37 2,185.58 36.819 610.563 59.74 0.051 3.217 67.00 0.116 9.984
Average 22 3,603.53 42.226 3,373.914 58.59 −0.023 6.310 65.03 −0.013 9.926
27 3,236.04 39.774 1,332.162 67.52 −0.001 6.151 75.17 0.010 15.093
32 2,919.99 37.245 651.342 65.98 0.031 3.600 72.84 0.043 10.178
37 2,607.67 34.661 361.847 63.30 0.044 1.857 73.12 0.077 7.606
AVG 3,091.81 37.183 1,172.794 63.85 0.013 4.479 71.54 0.029 10.701
102 4 Energy-Efficient Algorithms for Multiview Video Coding
37 70
35
33 60
31
29 50
500 1500 2500 3500
Bitrate [Kbps]
70
50
Relax Aggressive
0.10
0.00
Relax Aggressive
Fig. 4.27 View-level time savings and ∆PSNR comparison of Relax and Aggressive levels (Exit
sequence, QP = 32)
4
Per MB
2
Relax Aggressive
4.3 Multilevel Mode Decision-based Complexity Adaptation 103
a
Average #Modes Evaluated/MB Intra 4x4 Intra 16x16 8x8 & Below 8x16
5 16x8 16x16 SKIP
0
22 27 32 37
QP Value
b
Average #Modes Evaluated/MB
0
22 27 32 37
QP Value
Fig. 4.29 Detailed number of evaluated modes for (a) Relax and (b) Aggressive (Exit Sequence)
0.4
Relax 0.2
0.0
0.4
Aggressive
0.2
0.0
10 20 30 40
Picture Number
Fig. 4.30 Frame-wise PSNR loss comparison of Relax and Aggressive levels (Exit, QP = 32)
75
Time
50
25 1
100 Aggressive
Saving [%]
90
Time
80
70
10 20 30 40
Picture Number
Fig. 4.31 Frame-wise time saving comparison of Relax and Aggressive levels (Exit, QP = 32)
contain results for non-anchor frames, which are the primary focus for complexity
reduction in this algorithm.
There are 0, 1, and 2 available neighboring views available for the View 0, View
1, and View 2, respectively (representing all possible cases). View 2 exhibits a
higher ∆PSNR and lower time savings, while View 1 exhibits higher time savings
and a lower ∆PSNR for most of the frames when compared to the other plotted
views. This ratifies the view-level results from Fig. 4.27.
The sudden variations (i.e., valleys) in Fig. 4.31 correspond to the frames in the
middle of the GOPs, i.e., frames that have temporal-neighbors from the anchor
frames. In this case, more intra modes are evaluated (in phases 3 and 4) in addition
to the inter modes leading to a lower complexity reduction. View 1 due to the avail-
ability of all view-neighbors suffers less with such variations.
4.3 Multilevel Mode Decision-based Complexity Adaptation 105
Average Processing
10000
Time [µs/MB]
1000
100
10
1
View 0 View 1 View 2
The overhead of our complexity reduction scheme is already computed in the total
processing time and time savings. Figure 4.32 compares the average overhead of the
computational logic of our scheme with the average processing time of one MB
encoded using different schemes. It is noted that the overhead is 0.15 % of the aver-
age MB encoding time using the exhaustive RDO-MD. Figure 4.32 shows that the
overhead of our scheme is insignificant compared to its time savings.
In this section was presented the multilevel fast mode decision algorithm focusing
on complexity reduction MVC that exploits the image properties, RDCost, and the cor-
relation in the 3D-Neighborhood to provide complexity reduction with insignificant
PSNR loss. Our detailed analysis provides the foundation for the proposed scheme. In
order to react to the changing QP values, QP-based threshold equations are deployed.
For a trade-off between the desired complexity reduction and the resulting qual-
ity loss, two different operational levels are proposed for our scheme: the Relax and
Aggressive modes. However, to better exploit the complexity reduction vs. RD per-
formance, a control algorithm able to select at run time the most appropriate com-
plexity reduction level is desirable. In the following section an energy-aware
complexity adaptation based on fast mode decision is proposed.
This section presents the detailed experimental results for each Quality State of the
proposed energy-aware complexity adaptation algorithm compared to the RDO-MD.
The, overall results for the complexity adaptation algorithm are presented in section
“Comparing the Energy-Aware Complexity Adaptation to the State-of-the-Art” in
Chap. 6. The experiments used the experimental setup described in Sect. 6.1.
Table 4.4 presents the detailed PSNR, bitrate (BR), and time saving (TS) results
of the four Quality States of our scheme compared to the exhaustive mode decision
(RDO-MD). For the QS1 state our scheme provides a TS of up to 77 % with negli-
gible PSNR loss (avg. 0.089 dB). The TS goes up to 87 % for QS4 with an average
PSNR loss of 0.195 dB.
106
According to the motivational analysis presented in Sect. 3.1.1 and challenges dis-
cussed on Sect. 3.2, the two main sources of complexity and energy consumption in
the MVC encoder are the mode decision and the motion and disparity estimation
units. Along Sect. 4.3 distinct solutions for reducing the complexity and energy for
the MD were proposed. Moreover, an energy-aware complexity adaptation based on
mode decision was presented in order to enable run-time adaptivity to changing
system and content scenarios. In this section the target is to present solutions to
reduce the complexity and energy consumption associated with the second main
complexity source, the ME/DE unit.
In this section is presented a correlation analysis related to motion and disparity
vectors (MV, DV) followed by a Fast ME/DE algorithm. Our Fast ME/DE algorithm
was designed taking into account a future hardware implementation.
Our Fast ME/DE scheme (Zatt et al. 2011c) is based on the previously presented
3D-Neighborhood analysis. However, to exploit this correlation the motion and dis-
parity fields must be available. In order to establish these fields at least one frame
using DE and one using ME must be encoded with the optimal or a near-optimal
searching algorithm. In our scheme, to avoid major quality loss, all anchor frames
and the frames situated in the middle of the GOP are encoded using the TZ search
algorithm [the fast ME/DE algorithm used in JMVC (JVT 2009a)]. The anchor
frames are encoded using DE, while the frames in the middle of a GOP use ME or
ME and DE according to the view they belong. These frames encoded with high
effort are herein referred as Key Frames (KF), while the others are the Non/key
Frames (NKF). Once the motion and disparity fields are available all NKF can be
encoded based on these fields. The complete ME/DE search pattern is skipped for
all NKF. It only uses the predictors inferred from the 3D-Neighborhood.
Figure 4.33 presents the flow diagram of our proposed fast ME/DE scheme based
on the 3D-Neighborhood vectors correlation. It employs two different prediction
classes: Ultra Fast Prediction and Fast Prediction. The scheme is composed of three
108 4 Energy-Efficient Algorithms for Multiview Video Coding
Frame Level
Evaluation Check Temporal & Disparity
Neighbors Availability
Calculate Predictors
Confidence Level
MB Level
Evaluation Spatial Neighbors
& Prediction Availability
Re-calculate Predictors
Confidence Level
MV/DVs Storage
The predictors Confidence Level is calculated based on the offline hit value, as
presented in Table 4.1. Each predictor is associated with a Confidence Level (hit
value). If one predictor has a Confidence level higher than a threshold (CLPred > CLTH),
the current MB is classified to be encoded as Ultra Fast Prediction. Otherwise, the
MB is classified to be encoded with the Fast Prediction. In case of Ultra Fast
Prediction MBs, only three vectors are tested: the predictor with highest Confidence
Level (also referred as Common Vector), the Zero vector, and the SKIP vector. The
Zero and SKIP vectors are tested because of their high occurrence. Fast Prediction
MBs test all available predictors in addition to the Zero and SKIP vectors. It is
important to mention that even if all predictors are available and different (this worst
case rarely occurs), only 13 predictors are tested.
In Table 4.5 the fast ME/DE results are detailed for the four evaluated sequences
considering three different QPs (22,32,42). The TZ Search with a search range
of [±64, ±64] is used for comparison as it is used for the Key frames and performs
23× faster compared to the Full Search (not used for performance comparison),
110 4 Energy-Efficient Algorithms for Multiview Video Coding
40
PSNR [dB]
36
Uli Breakdancers
32
28
2 6 10 14 1 3 5 7
Bitrate [Mbps] Bitrate [Mbps]
80
60
40
20
0
Ballroom Vassar Breakdancers Uli
while providing the similar rate-distortion results (Yang 2009). Compared to the TZ
Search, our fast ME/DE provides 83 % execution time saving at the cost of 11 %
increase in bitrate and 0.114 dB of PSNR loss. In the best case, the execution time
savings go up to 86 %, which represents a significant complexity reduction.
Moreover, the reduced number of candidate blocks leads to a lower number of
external memory accesses.
Figure 4.34 presents the RD curves for two XGA (1024 × 768) video sequences
Uli and Breakdancers. It is noted that the RD curves of our fast ME/DE algorithm
are close to that of the Full Search (used for quality comparison only). Compared to
the Full Search, our fast ME/DE algorithm suffers from an insignificant loss of
0.116 dB (on average).
The view-level execution time savings are presented in Fig. 4.35. Note that for
all views (except for View 0 of Vassar sequence) the time saving is ≥80 %. The
execution time savings for the high-motion sequences are slightly more than that in
the low-motion sequences. Figure 4.36 presents the average number of SAD opera-
tions for ME/DE of one MB using Full Search, TZ Search, and our fast ME/DE
algorithm. Averagely, the proposed scheme reduces more than 99.9 % in compari-
son to Full Search and 86 % to TZ Search. Note, the detailed results in Figs. 4.35
and 4.36 are for QP32.
4.5 Video-Quality Management for Energy-Efficient Algorithms 111
100000
#SAD Operations
10000
[SAD/MB]
1000
100
Full Search TZ Our
4.5 V
ideo-Quality Management for Energy-Efficient
Algorithms
Although the energy and complexity reduction algorithms presented along this
chapter were carefully designed to reduce undesirable effects to the coding effi-
ciency, they lead to some level of quality drop due ton the heuristics and simplifica-
tions inserted in the encoding process. For this reason, an algorithm to manage the
coding process and compensate eventual video-quality losses is required. This man-
agement algorithm, however, must also consider and optimize the video quality
and bitrate trade-off in order to increase the coding efficiency while respecting
bandwidth constraints as discussed in the following.
Despite the high coding efficiency provided by MVC, the transmission and stor-
age of 3D videos remain a big challenge, especially for services operating over
bandwidth/buffer-constrained infrastructures. It becomes even more challenging
due to changing input video properties, run-time variations on video encoder state,
battery level, and user preferences. Thus, to provide high video quality while meet-
ing channel bandwidth/buffering constraints it is necessary to further optimize the
bandwidth usage by intelligently regulating the bits allocation. Therefore, a rate
control algorithm is implemented to dynamically find a good compromise between
the coding efficiency and video quality by adapting the QP.
In this section is presented the Hierarchical Rate Control (HRC) (Vizzotto et al.
2012) for MVC that employs coupled Model Predictive Control-based frame-level
RC and Markov Decision Process-based BU-level RC. Before presenting the HRC,
however, a bitrate allocation study within the 3D-Neighborhood is detailed.
In this section is presented the proposed Hierarchical Rate Control (HRC) for MVC,
depicted in Fig. 4.37. The HRC is responsible for controlling the encoder output
bitrate, in accordance with the user preferences and/or channel limitations, by
112 4 Energy-Efficient Algorithms for Multiview Video Coding
QPBU
Markov Decision
Process QP QPFR
Section 4.5.3.2
Multiview
Frame-Level Rate Control Video
Model Predictive Control Encoder
Tbr QPa
Model QP= f(Tbr)
Section 4.5.2.1 GBR
Optimizer
Constraints
(Channel
Observer Bandwidth,
Frame Rate,
& GOP Length…)
History Manager
monitoring the MVC encoder and actuating through QP adaptation. It can be con-
ceptually divided in two actuation levels (1) frame level (that encapsulates GOP and
frame levels) at coarse grain and the (2) basic unit level at fine grain. The MVC
encoder receives the video sequences as input along with all user preferences and
configurations to start the encoding process. The Model Predictive Control-based
frame-level RC models the system behavior considering the encoding hierarchy and
predicts the bitrate allocation at frame level considering temporal, view, and GOP-
phase (inter-GOP) correlation. It defines the optimal QP for the predicted frames,
the base QP, and forward it to the Markov Decision Process-based basic unit-level
RC. At BU level, a fine-grained decision is taken to define the QP variation consid-
ering the image properties in terms of regions of interest. The fine-grained adapta-
tion promotes an increase in objective and subjective video qualities inside the
frame by allocating more bits to the RoI (in our case the hard-to-predict regions; see
section “Region of Interest” in Chap. 2 and Sect. 4.1.3). The decision maker consid-
ers the previous knowledge, by implementing the Reinforcement Learning (RL)
method, to increase or decrease the QP in relation to the base QP. To couple the
frame and BU level in HRC, the RL unit feedbacks both the MPC and the MDP to
keep system consistency and avoid mismatches. The HRC employs an observer unit
able to read, store, and manage the MVC encoder feedback (generated bitrate) and
variables that define the encoder system state (target bitrate, QP, input constraints,
etc.) in order to support the bitrate prediction and actions/decision taking. Also, an
image properties extractor is employed to build the RoI map used for BU-level RC.
This integration allows HRC to properly exploit the influence of spatial, temporal,
view, and GOP-phase inputs to define a global optimal control action.
4.5 Video-Quality Management for Energy-Efficient Algorithms 113
The frame-level MVC Rate Control problem matches the control-theory superposi-
tion principle (Tatjewski 2010) defined as the response at a given place and time of
the linear system caused by multiple stimuli. Model Predictive Control (MPC) tech-
niques (García Carlos et al. 1989; Morari and Lee 1999) have demonstrated to accu-
rately predict the response of multiple stimuli dynamic systems such as MVC
encoder while incorporating the phase concept (periodic behavior) present in
GGOP-level RC (see Sect. 4.1.3). MPC outperforms traditional feedback control-
lers by efficiently integrating input stimuli to state space constraints while providing
flexibility by employing rolling input and output horizons (see section “Model
Predictive Control” in Chap. 2).
As discussed in section “Model Predictive Control” in Chap. 2, the main goal of
a Model Predictive Controller is to predict the future behavior of a system state and/
or output over a finite time horizon as well as compute the future input signals at
each step. These actions occur by minimizing a cost function under inequality con-
straints on the manipulated control or the controlled variables. In this work the MPC
operates at frame level predicting the bitrate and providing the QP for each frame to
be encoded. The rate controller tries to define a sequence of actions and then induce
114 4 Energy-Efficient Algorithms for Multiview Video Coding
the system to a desired state while the negative effects of this action are reduced
respecting restrictions and taking constraints into account. In other words, the RC
defines a QP that optimizes the bandwidth or bit allocation while maximizing the
visual quality and reducing bitrate/quality sudden variations.
The bitrate prediction is performed considering the neighborhood correlation at
temporal, view, and inter-GOP domains. As discussed in Sect. 4.1.3, there is a high
correlation in the temporal and view neighboring frames inside the same GOP.
Moreover, there is also a periodic pattern that repeats at GOP level, the GOP-Phase.
Our MPC-based RC is able to exploit this correlation in order to accurately predict
4.5 Video-Quality Management for Energy-Efficient Algorithms 115
Fig. 4.38 MPC-based RC
I B B B B I B B B B
horizons
P B B B B P B B B B
B B B B B B B B B B
B B B B B B B B B B
Prediction Horizon –Current Frame
Control Horizon –View/Phase Frame
Control Horizon –Temporal Frame
QP
Basic Unit-Level Rate Control
GBR
QCLP
GBR
Rate Model
TBR BA ω Observer
GBR
w Constraints
(Channel Bandwidth, Frame
Rate, GOP Length…)
the future bitrate. Figure 4.38 represents the previously encoded frames used for
prediction (control horizon) and the current frame to be predicted (prediction hori-
zon) for a given MVC prediction structure. Our method employs a variable weight-
ing factor for frames considering their positions in relation to the current frame. The
variable weighting factor is calculated considering the number of references and
their distance to the current frame. With this extension our frame-level RC may be
directly implemented in any hierarchical bi-prediction structure (HBP) while still
catching the GOP-phase correlation.
Figure 4.39 shows in details the MPC optimization process and how the compo-
nent functions interact to each other. The Rate Model generates, based on the neigh-
borhood correlation, a bitrate prediction for the current frame, the target bitrate.
Based on the prediction an optimal QP is defined and the internal model is updated.
The system feedback and the actually used QP defined in the BU-level RC are
received through the observer.
116 4 Energy-Efficient Algorithms for Multiview Video Coding
The MPC-based Rate Control defines the target bitrate (TBR(f)) considering the band-
width (BW) and frame rate (FR) constraints along with the neighboring frames
weights (w) and their frames bit allocation (BA), as shown in Eq. (4.19):
BW
TBR ( f ) = ± w(BA). (4.19)
FR
The feedback and the correlation between frames vary with the type of each
frame. The bitrate range of distinct frame types (I, P, and B) lie in different ranges;
see Fig. 4.11. Thus, the weighting factors for each frame type must be different.
A static weight (wI) is statically predefined for I frames (Li et al. 2003), while P- and
B-frame weights (wP and wB) are calculated dynamically considering the weights of
temporal neighboring frames. Equation (4.20) shows how the weights are calcu-
lated considering the HBP in order to respect the local linearity inside the current
GGOP, where w GOP is the average of w in the current GOP, f represents the fth frame
of a given type (I, P, or B) in the processing order, u = 1/(LGOP–1), and LGOP denotes
the GOP length. For a smooth weighting propagation, w is limited according to a
statistically defined range:
wI = 0.75
{ {
wP = max wf −1 − 2 u,min w GOP − .25, wI − 2 u }} (4.20)
æ ö
ç BA ( f -1) wim, j, n ÷ BW
BA ( f ) = ç BA ( f -1) - + m n - 1÷ ´ ´ LGOP , (4.21)
ç N A -1 ÷ FR
ç
è
å0 å0 w m,n
÷
ø
( BRim, j,n × QPim, j, n(f −1) ) + ( LGOP − 2)ω im, j,n(f −1)
w im, j, n =
LGOP −1 (4.22)
4.5 Video-Quality Management for Energy-Efficient Algorithms 117
Once the prediction is performed, the RC must define a proper action in terms of QP.
The QP is determined by summation of all target bitrate (TBR(f)) in the prediction
horizon, the summation of all generated bitstream in the control horizon (BR), and
the history of QPs (HQP), as shown in Eq. (4.23). Note, the QP defined in the frame-
level (QPFL) RC is not directly used by the MVC encoder but forwarded to the
BU-level RC to refine the QP selection:
QPFL = H QP ´
å p
i =1 TBR
. (4.23)
å m
i =1 BR
{ {
Q k = min Q( k −1) + 2,max Q( k −1) − 2, Q }} (4.24)
{
Q( k −1) = min QPmax ,max {QPmin , QCLP } }
. (4.25)
i=L N Fr
In the following are presented the detailed results of the frame-level only HRC.
Table 4.7 presents the bitrate results generated using SMRC (Single-View Mode
Rate Control) extrapolated from the H.264 reference software (JM) using the qua-
dratic MAD prediction (Li et al. 2003). To measure the target bitrate accuracy, we
use the Mean Bit Estimation Error (MBEE) metric presented in Eq. (4.26). On aver-
age, the proposed frame-level RC provides 1.13 % (up to 1.58 %) of bitrate error
while the SMRC provides 2.46 % (up to 2.91 %). The results show that the frame-
level HRC predicts more accurately the bitrate behavior and is able to adapt the QP
in order to reduce the output error:
ìïGOPsize ´ Nv Rt - Ra üï
MBEE = í å ´ 100 ý / N Fr . (4.26)
îï
i =0 Rt þï
118 4 Energy-Efficient Algorithms for Multiview Video Coding
RoI
Basic Unit-Level Rate Control Variance Map
R-L Core PR
f
QFL U(s,s’) MS
BS
Mδ
RS QFL
Multiview
Video
GBR HR QPBU QPFR Encoder
QFL
GBR
MPC Frame-Level Rate Control Observer
partly under a control. Finally, to dynamically adjust the matrix of states for next
decision, the RL is responsible to feedback the system response to the current state
for both BU-level and frame-level control.
(BU i - m ) 2
M S (i , j ) = . (4.27)
N -1
The HRC implements the BU-level RC by employing the Markov Decision Process.
The MDP works over a matrix of independent states Mf(s) representing the QPs of
each BU within a frame. Each BU has a set of possible actions A with associated
rewards RS and transition probabilities f(s,δ). In our model the possible actions are
increase, decrease or maintain the QP value defined at frame level, as shown in
Eqs. (4.31) and (4.32). A matrix of coefficients M(δ) is used to define the reward for
each action according to Eq. (4.28). The rewards RS are calculated based on the RoI
map MS, matrix of coefficients M(δ), and the reinforcement learning RL (see section
“Reinforcement Learning” in Chap. 2), as shown in Eq. (4.29). For each action there
is a probability of transition f(s,δ) defined by Eq. (4.30):
QPBU ´ BS
M (d ) = å , (4.28)
Max QP ´ (TBR / N BU )
4.5 Video-Quality Management for Energy-Efficient Algorithms 121
RS = RL ´ M (d ) - M S , (4.29)
f ( s, d ) = PR ∓ Dd , (4.30)
QPFR + 1 ∀ f ( s, d ) > +1
QPBU = QPFR − 1 ∀ f ( s, d ) < −1
QPFR ∀ − 1 < f ( s, d ) < +1, (4.31)
æ å M (s) ö
QPFR = trunc ç f
÷. (4.32)
ç N BU ÷
è ø
DTBR ´ å QPBU kL
HR = , (4.33)
åG k
BR L ´ DQPFL
( )
M s, s ′ ∀ − 1 > f ( s, d ) > +1
U ( s, s ′ ) = QPFL f
M f ( s, s ) ∀ − 1 < f ( s, d ) < +1 (4.34)
In this section are presented the detailed results of the proposed HRC compared to
the baseline solution (i.e., the JMVC without RC) and to the SMRC (Single-View
Mode Rate Control). The comparison with the state of the art is presented in Chap. 6.
Table 4.9 presents the accuracy in terms of MBEE (less is better) for our HRC com-
pared to baseline RC solutions. The test conditions are detailed in Sect. 6.1. On
average, our Hierarchical Rate Control provides 1.6 % MBEE decrease while rag-
ing from 0.7 % to 1.37 %. The superior accuracy is a result of the ability to adapt the
QP jointly at frame and BU levels considering the neighborhood correlation and the
video content properties.
Table 4.10 presents the objective rate distortion in BD-PSNR (Bjøntegaard Delta
PSNR) and BD-BR (Bjøntegaard Delta Bitrate) (Tan et al. 2005) in relation to
JMVC without RC. The HRC provides 1.86 dB BD-PSNR increase along with
122 4 Energy-Efficient Algorithms for Multiview Video Coding
10 Target
0
1 3 5 7 9
GOP
4.6 S
ummary of Energy-Efficient Algorithms for Multiview
Video Coding
B. Zatt et al., 3D Video Coding for Embedded Devices: Energy Efficient 127
Algorithms and Architectures, DOI 10.1007/978-1-4614-6759-5_5,
© Springer Science+Business Media New York 2013
128 5 Energy-Efficient Architectures for Multiview Video Coding
It enables to read only the required data from external memory, avoiding perfor-
mance drawback. Additionally, the mentioned memory access prediction is for-
warded to the on-chip memory run-time management in order to adapt the power
states of each memory sector, resulting in reduced energy consumption. The fea-
tures of this architecture are summarized below.
Based on in-depth memory access correlation analysis (see Sect. 5.3.1), it was
possible to conclude that the memory access prediction can be further improved in
relation to the solution presented in Sect. 5.4.2. As a result, novel memory power-
gating control schemes may be proposed. In this section we present a ME/DE hard-
ware architecture featuring an application-aware power-gating scheme able to
consider the 3D-Neighborhood correlation and reduce the energy consumption
related to memory. The memory hierarchy and power management are carefully
explained along this section. Below the main features implemented in this architec-
tural solution are summarized.
Hardware architecture with Multibank On-Chip Memory: A hardware architecture
with parallel SAD modules is proposed. A pipelined schedule is proposed to enable
high throughput. Moreover, the hardware is equipped with a multibank on-chip
memory to provide high throughput in order to meet high-definition requirements.
The size and the organization of the memory are obtained by an analysis of the fast
ME/DE scheme. Each bank is partitioned into multiple sectors, such that each sec-
tor can be individually power-gated to save leakage. The control of the power gates
is obtained from the application layer.
An On-Chip Multibanked Video Memory: based on the offline memory usage analy-
sis, an algorithm is proposed to determine the size of the on-chip memory by evalu-
ating the trade-off of leakage reduction and misses (as a result of reduced-sized
memory). Afterwards, the organization (banks, sectors) is obtained by considering
the throughput constraint. Each bank is partitioned into multiple sectors to enable a
fine-grained power management control. The data for each prediction direction is
stored in distinct sections.
Dynamically Expanding Search Window Formation Algorithm: Instead of prefetch-
ing the complete rectangular search window, a selected partial window is formed
and prefetched for each search stage of a given fast ME/DE scheme depending upon
the search trajectory, i.e., the search window is dynamically expanded depending
upon the outcome of each search stage. An analysis is performed to highlight the
issues related to the expansion of the partial window at each search stage. The
search trajectories of the neighboring MBs and their spatial and temporal properties
(variance, SAD, motion, and disparity vectors) are considered to predict at run time
the form of the search window for the current MB. This results in a significantly
reduced energy for off-chip memory accesses.
Application-Aware Power-Gating Scheme for the On-Chip Memory: Depending upon
the fast ME/DE scheme and the macroblock properties, the amount of required data
is predicted. Only the sectors to store the required data are kept powered on and the
remaining sectors are power-gated. A power-gating scheme is employed. Depending
5.1 Motion and Disparity Estimation Hardware Architecture 129
ME/DE Architecture
SAD Calculator
upon the previous macroblock (i.e., using the knowledge from the application that
determines a prediction direction), different sectors can be completely power-gated.
The architectural template overview is presented in Fig. 5.1. It is composed of
five main blocks named (a) Energy/Complexity-aware Control Units; (b)
Programmable Search Pattern Control Unit; (c) Address Generation Unit (AGU); (d)
On-chip Video Memory; and (e) SAD Calculator. Energy/Complexity-aware Control
Units box is not detailed in this section since this block represents the implementa-
tion of all energy/complexity-aware control techniques presented in Sects. 5.1.1,
5.1.2, 5.1.3, and 5.1.4. Our ME/DE architectures communicate with the remaining
MVC encoder blocks by providing SAD values and motion/disparity vectors for the
mode decision unit. The reference frames data is read from the external memory that
stores the decoded picture buffer (DPB). The MVC encoder writes the DPB after the
encoded frames are reconstructed and filtered by the deblocking filter.
The proposed Programmable Search Pattern Control Unit was designed employ-
ing a microprogrammed style in order to facilitate the implementation and experi-
mentation of multiple search patterns. It communicates with the Energy/
Complexity-aware Control Units in order to provide search pattern information
such as search pattern used, memory regions accessed, and number of candidate
blocks tested. Energy/Complexity-aware Control Units feedback the Programmable
search pattern control unit with energy/complexity budget, search pattern to be
employed for future MBs, vector predictors, search directions to be exploited, active
on-chip memory sectors, etc. This communication and the hardware actually imple-
mented inside the Energy/Complexity-aware Control Units depend on which
energy-efficient techniques are designed for the specific architectural solution.
Once the search pattern is defined, the candidate blocks are forwarded to the
AGU as a set of points inside the search window. The AGU is responsible for trans-
lating these points into a sequence of actual memory addresses. As the on-chip
130 5 Energy-Efficient Architectures for Multiview Video Coding
video memory is implemented in a cache fashion, the cache tags are generated using
the address provided by the AGU according to a predefined tag format defined in
Sect. 5.1.3. The on-chip video memory is implemented using SRAM memory to
locally store samples belonging to the search window. The samples are brought
from the external memory in block-based read operations. To check if the samples
required by the search control are available on-chip, the above-mentioned cache
tags are tested employing a fully associative approach.
The sum of absolute differences (SAD) Calculator is composed of an array of
4-sample SAD Processing Elements (PEs), an array of adder trees, and comparator
trees. The number of PEs depends on the throughput require. The PEs connectivity
depends on the block sizes supported by the architecture and the number of candi-
date blocks processed in parallel. The SAD Calculator is fed in parallel by the on-
chip video memory. The number of PEs and the memory width must be jointly
defined in order to maximize the hardware usage and processing throughput.
In the following sections the ME/DE hardware modules are presented in details.
All the data processing is performed in the SAD Calculator unit. It receives the
current MB samples, which are stored in a small local buffer (omitted in Fig. 5.1),
and the reference samples to determine the SAD between the original block and the
reference block according to Eq. (5.1):
n
SAD = å Orig (i ) - Ref (i ) . (5.1)
i =1
Each Processing Element, as depicted in Fig. 5.2, calculates the SAD for four
samples in parallel. PEs are composed of four subtractors, one absolute operator,
and three adders. Although the hardware description supports multiple sample bit-
depth the implementation was limited to 8-bit sample inputs. The PEs are associated
using adder trees to generate the SAD for a whole block of N × N. In the example
presented in Fig. 5.2, the SAD Calculator is designed to process a 4 × 4 block in
parallel by associating four PEs (PE0…PE3 process one 4 × 4 block). In this sce-
nario, each adder tree requires further three adders in two logic levels. The larger the
block to be processed, the bigger the adder tree. For 16 × 16 blocks, 63 adders are
required in six logic levels. Therefore, pipelining is required for bigger block sizes
in order to avoid operation frequency reduction. For simplicity, Fig. 5.2 omits pipe-
line barriers.
After the SADs are calculated for the multiple block processed in parallel, the
SAD Comparators Tree is used to select the smallest SAD values. Along with the
SAD value the SAD Calculator feedbacks the Programmable Search Pattern Control
with the position where the smallest SAD was found. This information is used to
5.1 Motion and Disparity Estimation Hardware Architecture 131
Orig(3)
Orig(2)
Orig(1)
Orig(0)
Ref(3)
Ref(2)
Ref(1)
Ref(0)
SAD Calculator
PE0
− − − −
PE6
PE5
PE4
PE3
PE2
PE1
...
+ +
+ + + +
+ + SAD Processing
... >
>
decide the following steps of the search process. The SAD Comparators Tree size
and logic depth depend on the number of blocks processed in parallel.
Hardware implementations for the search control unit are typically limited to one
single search pattern. In the architecture proposed in this monograph, we implement
a Programmable Search Control Unit able to support multiple search patterns with-
out hardware redesign by employing the microprogramming concept. By simply
reprogramming the Search Pattern Memory (SPM) it is possible to change the
search pattern (or shape). It allows fast hardware ME/DE algorithms design and
verification.
The Programmable Search Control Unit is composed of a finite state machine
(FSM) and an SPM, both presented in Fig. 5.3. Firstly, the FSM identifies the cur-
rent MB position and reads, from the SPM, the first search pattern. By adding the
current MB position and the coordinates of each search point defined in the SPM,
132 5 Energy-Efficient Architectures for Multiview Video Coding
a b
Reset Search Pattern Memory
Other Points … … … …
Read
New Point n (position and next search pattern) X<12b> Y<12b> NextPatternID<8b>
Pattern
Other Patterns … …
New MB
No Other Points … … … …
Fig. 5.3 Programmable Search Control Unit (a) FSM and (b) program memory
the Programmable Search Control Unit determines and dispatches, in parallel, all
the search points for that specific search pattern step. Depending on the feedback
from the SAD Calculator, the next search step is selected among three options:
repeat the same pattern, use another pattern described in the memory, or process
next MB.
The SPM program memory organization is presented in Fig. 5.3. The left table
shows the description of each line while the right table specifies the fields and bit-
depth of each field (the number of x-bits is represented as <xb>). A 32-bits program
memory is used. The first SPM line brings the total number of patterns programmed
and the ID of the first pattern (FirstPatternID) to be used, where each search pattern
has a unique ID sequentially defined. The search pattern is described starting by a
line containing a 16-bit ID (actually only the eight LSB are considered) and the
number of points belonging to that specific search pattern. In the following, each
point is described using a (X, Y) coordinates pair and the NextPatternID. X and Y are
12-bits integer numbers representing the displacement between the search point and
the search pattern central reference point. The search pattern central reference is
initially defined as the MB position and evolves according to the algorithm interac-
tion assuming the best SAD point as center. The 12-bit coordinates enables a search
range of up to [±2048, ±2048] in relation to the search pattern central reference. The
NextPatternID specifies the next pattern to be used in case this point presents the
lowest SAD among all points of the current pattern. In case the search point repre-
sents a terminating point (in case it is the lowest SAD the search ends) the
NextPatternID is defined as the reserved value 0xFF.
Table 5.1 provides a simple example using a two-step Log Search (Marlow et al.
1997) with window size W = 16 (search range [±8, ±8]) and finishes with a local
Cross-Search (Ghanbari 1990) refinement. The first Log Search step (ID 0x0000)
5.1 Motion and Disparity Estimation Hardware Architecture 133
leads to the second Log Search step (ID 0x0001) except for the central position (line 2)
that leads to the Cross-Search refinement (ID 0x0002). After the second Log Search
step all points lead to the Cross-Search refinement (ID 0x0002). The terminating
step Cross-Search points to the reserved terminating pattern ID 0xFF.
Although it is possible to easily extend the Programmable Search Control Unit,
the current implementation requires modifications in the FSM to support features
such as early termination and thresholds adaptation.
The on-chip video memory used in this monograph works as a cache memory com-
posed of an address comparator block and the on-chip SRAM memory itself, as
represented in Fig. 5.4. The address requested by the Search Control and forwarded
by the AGU (still using video representation) is compared to the Tags of each
134 5 Energy-Efficient Architectures for Multiview Video Coding
addr .
On-Chip Video Memory
Tag On-Chip SRAM
Comparator
Address
addr.
data
request data
External Memory
memory entry. Each entry represents a full MB of the reference frame and the Tags
comparison is performed in parallel. In case the reference is available on-chip the
requested data is transmitted to the SAD Calculator unit. Otherwise, the read request
is sent to external memory. The addresses are provided to the AGU that translates
the address from video representation to a burst of addresses mapped to the actual
memory address space (see Sect. 5.1.4). After updating the data, the samples are
sent to the SAD Calculator.
The tag in the on-chip video memory is defined as shown in Fig. 5.5 where the
<nb> represents a value with n bits wide. The tag is composed by an unique view
identifier, six LSBs of the Picture Order Counter (POC) within that specific view,
and the X and Y coordinates of the reference MB. By using this tag it is possible to
support up to 16 views, access reference frames within a 64-frames temporal win-
dow, and support up to 2k × 4k (QDH) video resolutions. This definition, however,
can be easily extended by increasing the bitdepth of each field in order to handle
increased demands.
Remember this is just a template description, in the following sections we are
going to present, in detail, the SRAM organization, sizing, and energy management
of the on-chip video memory for different scenarios.
The Address Generation Unit (AGU) is used to convert the addresses defined in
video representation provided by the Search Control to a linear memory
5.1 Motion and Disparity Estimation Hardware Architecture 135
PosX SizeX
PosY View(v)
Video Frame(f )
Representation
SizeY
Linear Memory
Representation
... ...
Although significant complexity reduction was achieved through the Fast ME/DE
algorithm presented in Sect. 4.4, real-time motion and disparity estimation feasibil-
ity for mobile devices depends on energy-efficient dedicated hardware architec-
tures. The dedicated hardware architectures must exploit the parallelism available in
the MVC prediction structure and feature an optimized scheduling scheme in order
to optimize the hardware usage and energy consumption. Therefore, a deep under-
standing of the distinct parallelism levels and search algorithms behavior is required.
Along this section four levels of parallelism available in the MVC encoder are dis-
cussed in Sect. 5.2.1. Possible scheduling schemes are proposed and detailed in
Sect. 5.2.2.
Due to the prediction structure used in the MVC as depicted by the arrows in
Fig. 5.7, four levels of parallelism can be exploited to achieve high throughput. For
an easy understanding, the frames in Fig. 5.7 are ordered according to the coding
sequence using numbers for the key frames (KF) and the alphabet order for nonkey
frames (NKF). The I frames are not processed by ME/DE and are considered avail-
able. Frames 2′, 4′, and 6′ belong to the previous GOP.
View-Level Parallelism: Although MVC defines the Time First decoding order (i.e.,
all frames inside a GOP of a given view are processed and then the next view is pro-
cessed), this order is not mandatory (i.e., not forced by the standard) during the
encoding process, as far as the bitstream respects it. For instance, views S1 and S3
can be encoded completely in parallel after S0 and S2 reference views are available.
Frame-Level Parallelism: Within a view there are frames with no dependencies
between them. For example, using one reference frame per prediction direction
(1 west, 1 east, 1 north, and 1 south) frames A and B can be processed in parallel.
Analogously, it is possible to process the frames C, D, E, and F in parallel.
S0 I C A D 1 E B F I
S1 4′ O M P 5 Q N R 4
S2 2′ I G J 3 K H L 2
S3 6′ U S V 7 W T X 6
Data Refetch
Stage 0 Stage 1 Stage 2 Stage 3 Stage 0 . . . Stage 0 Stage 1 Stage 2 Stage 3 Stage 0
Prefetch Stage 2
SAD
Compu- Stage 0 Stage 1 STALL Stage 2 Stage 3 . . . Stage 0 Stage 1 Stage 2 STALL
tations
MB MB
Along this section we present two scheduling strategies implemented in our archi-
tectural solution. Firstly, the scheduling designed for any search pattern algorithm is
described in Sect. 5.2.2.1. In the following, a scheduling scheme specific for our
Fast ME/DE algorithm is presented in Sect. 5.2.2.2. The scheduling for our Fast
ME/DE considers two logical search control units, one for the KF (using TZ Search)
and other for the NKF (using the fast search). The hardware implementation, how-
ever, can be simplified and implemented as a single control unit, as presented in
Sect. 5.5.
In Fig. 5.8 is presented the MB-level ME/DE processing pipeline scheduling for the
selected search algorithm including the data prefetching and SAD computation for
different search stages. During the SAD computations of the preceding search stage,
the partial search window data is prefetched for the succeeding search stage.
However, in case of a Search Map miss (see Sect. 5.3.2) where the data needs to be
138 5 Energy-Efficient Architectures for Multiview Video Coding
TZ
1 1 2 3 3 3 4 4 5 5 5 5 6 7 7 7 1 1
Module
(W) (E) (W) (W) (N) (E) (N) (S) (W) (N) (E) (S) (N) (W) (N) (E) (W) (E)
fetched again, stall for one candidate data prefetch happens (see Fig. 5.8a). In case
the search algorithm stops due to early termination criteria, the prefetch data in the
search window is wasted (see Fig. 5.8b). This scheduling scheme was employed in
our hardware architecture and experimented for TZ Search and Log Search algo-
rithms in order to evaluate the performance while considering the Search Map pre-
diction hits/misses.
A novel scheduling scheme (Zatt et al. 2011c) designed for the Fast ME/DE search
algorithm (Sect. 4.4) is proposed and depicted in Fig. 5.9. The numbers and letters
are consistent to Fig. 5.7. The letters between parentheses represent the prediction
direction E (East), W (West), N (North), and S (South). The dotted boxes represent
a frame that belongs to the next GOP. This scheduling assumes the existence of two
logic control units operating in parallel: one processing the TZ search for KF and
the other processing the fast ME/DE for NKF. However, it can be easily mapped to
a serial architecture or extended to a more parallel hardware to further exploit the
multiple levels of parallelism inherent to MVC.
Each time slot of the TZ Module is dedicated to perform the search for a com-
plete KF in one reference frame. It is noticeable that the coding time for a given
GOP is the time to perform 16 TZ searches. This number represents a reduction of
81 % in the number of complete TZ searches if compared to a system without our
fast ME/DE search (that performs 88 complete TZ searches).
For NKF encoding there is a Fast ME/DE module. After the required reference
frames are processed by the TZ module (solving the data dependencies) all NKF in
the same view are processed following the predefined coding order (as shown by the
alphabetic order). The data dependencies between the KF and NKF are represented
in Fig. 5.9 by dashed arrows. To avoid pipeline stalls, the GOP-level pipeline starts
the TZ search in KF of next GOP while the fast ME/DE Module concludes the cur-
rent GOP processing. Since the Fast ME/DE represents less than 1 % of the process-
ing effort of TZ Search the fast ME/DE module processes the NKF in a burst, and
in the following, it is clock-gated (CG) until the next usage. For simplification, in
Fig. 5.9, the encoding of NKF does not present the details showing which prediction
direction is tested. However, in the slot of given frame A, all required prediction
5.2 Parallelism in the MVC Encoder and ME/DE Scheduling 139
a
TZ Search
Control MB1 MB2
TZ Module
TZ Search
Window MB1 CG MB2
Prefetch
MB1 MB2
TZ SAD
MB Timming
b
Fast ME/DE
Fast ME/DE Module
MB Timming
c
Fast ME/DE
Pred. Pred. Pred. SKIP Pred.
Fast ME/DE Module
Vector
Calculation Vector 1 Vector 2 Vector N Vector Vector 1
Fast Mode
MB Timming
Fig. 5.10 MB-level pipeline schedule for (a) TZ module and fast ME/DE module in (b) Ultra fast
and (c) Fast operation modes
directions are serially tested (in the specific case of frame A, West and East
directions).
The internal scheduling of the TZ Module operates at MB level, as presented in
Fig. 5.10a. The three main tasks are the algorithm control which is always active,
the TZ search window prefetch logic which can be clock-gated after bringing the
search windows from the external reference memory (DPB), and the TZ SAD com-
putation that starts processing as soon the first useful reference block is available.
As the Fast ME/DE scheme has two prediction modes (the Fast and Ultra Fast
prediction modes) two distinct pipeline scheduling are required for the Fast ME/DE
Module. The Ultra Fast scheduling is presented in Fig. 5.10b and the Fast schedul-
ing in Fig. 5.10c. Three tasks are considered: Fast ME/DE Vector Calculation, Data
Prefetching, and SAD calculation. First, the Zero Vector is tested because it has no
data dependencies with the spatial neighbors. Afterwards, the predictors are evalu-
ated and the Common Vector (if it exists, for algorithm details check Sect. 4.4) or
140 5 Energy-Efficient Architectures for Multiview Video Coding
Predictor Vector 1 are processed (represented by the gray blocks in Fig 5.10b, c).
This is the second vector evaluation step (MB-Level Evaluation in Fig. 4.33) once
the vectors can be calculated based on the Frame-Level Evaluation (Sect. 4.4.1). If
the spatial vector points to a different position, additional data is then fetched and
processed (Predictor Vector N). The last vector to be tested is the SKIP predictor
that depend upon the previous MB. In this pipeline stage, the previous MB MV/DV
information must already be available to avoid pipeline stalls. The MB time borders
(indicated by the vertical dashed lines in Fig. 5.10b, c) interfaces are the same for
both prediction modes allowing the mode exchange (Fast↔Ultra Fast) with no
pipeline stalls.
Real encoding systems do not implement the exhaustive search (Full Search, FS)
but fast search algorithms. Fast ME/DE search algorithms usually are based on mul-
tiple search interactions following a given geometric shape and may employ start
point selection and early stop criteria to reduce the computational effort. These
algorithms can provide expressive speedup and reduced number of memory accesses
at the cost of negligible quality loss. However, real systems may suffer due to the
irregular memory access pattern of external memory.
As a case study we present the memory access pattern for two fast ME/DE algo-
rithms: TZ Search and Log Search, considering low- and high-motion MBs (see
Fig. 5.11). These search algorithms are implemented in the MVC reference soft-
ware (JMVC) and their behavior represent a wide family of search algorithms.
During ME the high-motion areas perform higher number of memory accesses
compared to low-motion areas in which the search converges quickly. Analogous
behavior happens for DE where objects with high disparity require more effort to
find a good match. In Fig. 5.12 the memory access profile for one frame is pre-
sented. The flat regions represent the low-motion/disparity areas while the peaks are
located at high-motion/disparity ones. Other important observation is that for a
same image region or object the number of memory access and the search pattern
5.3 Dynamic Search Window Formation 141
behavior is similar, i.e., neighbor MBs that belong to the same object tend to have
similar memory and search pattern behavior.
Even considering high-motion/disparity regions it is noticeable that big part of
the search window is not accessed resulting in communication and storage energy
wastage. Averagely, the ME/DE accesses 19.85 % of the total search window using
TZ search and 1.01 % using Log Search. This represents that most of the search
window is read and stored in vain, as detailed in Fig. 5.13. The search pattern also
is of key importance in the accuracy vs. memory access trade-off. Compared to Log
search, the TZ requires more memory accesses (see Fig. 5.13), reaches extended
search area (see Fig. 5.11), and tends to provide more accurate search results.
142 5 Energy-Efficient Architectures for Multiview Video Coding
80
70
Ba Ex Va Fla Ba Ex Va Fla
llro it ss me llro it ss me
om ar nc om ar nc
o o
20 ME Group-2 DE
Memory Usage
[KBytes]
15 Group-1
10
5
250 500 750 1000 250 500 750 1000
Time (ms) Time (ms)
When further analyzing the memory requirements within a frame (see Fig. 5.14
for Ballroom sequence), two different variation zones are noticed in ME that corre-
spond to two different groups of MB, where MBs in a group have similar spatial and
temporal properties. MBs in the group-1 exhibit a low variation in their memory
usage, while MBs in the group-2 exhibit high-variation. The distinction between two
groups can be made by evaluating the average spatial and temporal properties of
MBs. Depending upon the group-level variations, low-leakage or high-leakage sleep
mode may be selected. The large variations for DE are primarily due to the bigger
search performed by the TZ algorithm for capturing longer disparity vectors.
Summarizing, an application-aware power management scheme for an on-chip
video memory needs to consider the knowledge of ME/DE algorithm, spatial and
temporal video properties (at both frame and MB levels), and correlation in the
3D-Neighborhood to determine the number of idle sectors and an appropriate sleep
mode for each sector.
Figure 5.15 presents the Search Map for two neighboring MBs (denoted as MBx and
MBx+1) using the Log Search algorithm. After the ME search is performed for the
MBx, a Search Map is built based on the search trajectory (i.e., the ID of the selected
5.3 Dynamic Search Window Formation 143
candidate search points at each search stage of the ME/DE scheme). As shown in
Fig. 5.15a, the first search stage selects the candidate with ID 6 as the best candi-
date. Similarly, candidates with ID 3 (at stage 2), ID 4 (at stage 3), and ID 4 (at stage
4) are selected as the best candidates at their respective search stages. This provides
a Search Map of [6,3,4,4] (the trajectory is shown by the red arrows). Note, for each
search stage there is an entry in the Search Map with the ID of the candidate with
minimum SAD at that particular search stage.
Considering the analysis of the MB neighborhood, a Search Map for the MBx+1
can be predicted from the Search Map of MBx. In case there is a deviation in the
search trajectory of these two MBs, there will be a miss in the on-chip memory due
to the prefetch of the false region (see the green box in Fig. 5.15b). Typically, these
misses are at the boundaries of the moving objects and occur in relatively few MBs
along the whole video frame. In case of a miss, there will be a stall only for the
prefetching of the first candidate data on the new trajectory (i.e., 16 × 16 pixel data).
All other candidates on the new trajectory will be then prefetched correctly (before
their respective SAD computations, thus not causing any stall) as the search pattern
design of the fast ME/DE schemes is fixed at design time (see the brown box for the
new prefetched data). Typically a miss in the trajectory depends upon the motion/
disparity difference of two MBs, which is significantly smaller in most of the neigh-
boring MBs due to high correlation between them.
Figure 5.16 depicts the pseudo-code for the algorithm of the dynamic search win-
dow formation and expansion. The algorithm works in two major steps. First it
144 5 Energy-Efficient Architectures for Multiview Video Coding
Fig. 5.16 Algorithm for Search Map prediction and dynamic formation of the Search window
predicts the Search Map from the spatial predictors (lines 3–21). Afterwards, it
checks if the search pattern matches the Search Map, prefetches the appropriate
partial search window, and updates the Search Map (lines 23–33).
Four spatial predictor with presenting high correlation with the current MB are
used to predict the Search Map (line 4). Afterwards, variance of these predictors is
computed and motion and disparity information is obtained (lines 5–6). Based on
the spatial, temporal, and view information, a matching function is computed that
provides a hint that predictors may belong to the same object or may exhibit similar
motion/disparity properties (lines 7, 9–12). Afterwards, the predictors are sorted
with regard to their similarity to the current MB (line 14). The closest predictor is
determined by computing the SAD with the current MB (line 15). In case the closest
predictor also belongs to the same object or exhibit similar motion/disparity, its
5.4 On-Chip Video Memory 145
Search Map is considered as the predicted Search Map (line 17). Otherwise, the
closest map is found in the remaining predictor set (line 19). If none of the predic-
tors exhibit similarity to the current MB, then the predicted Search Map is empty.
After the Search Map is predicted, it is used to form the search window. For each
search stage, the partial search window is determined according to the Search Map
and prefetched. In case the search candidates of the search pattern are present in the
Search Map (i.e., the search trajectory falls in the predicted region), the partial
search window is simply constructed according to the predicted Search Map and the
prefetched data is used (i.e., a case of hit) (see line 28). Otherwise, if the Search
Map is empty or does not contain the search candidate, the Search Map is ignored
for this stage onwards (see line 26). In this case the prefetched data is wasted and it
is considered as a miss. The partial search window is then constructed according to
the search pattern for the miss parts (see line 31, it can also be seen in the example
of Fig. 5.15b).
In the following section we discuss the architecture of our joint ME/DE scheme
along with the design of the multibank on-chip memory and application-aware
power gating.
On-chip storage of rectangular search windows incurs in increased area and leakage
of on-chip memory, like those presented in (Chen et al. 2006, 2007a, b; Ding et al.
2010; Saponara and Fanucci 2004; Tsung et al. 2009; Tsai et al. 2007). The size of
the dynamically formed search window is significantly lower compared to the rect-
angular search windows. This scenario becomes even more critical in MVC where
ME and DE are performed for multiple views using larger search windows (for
instance [±96, ±96] to capture high disparity regions in DE). Depending upon the
MB properties, the sizes of dynamically expanding search windows may vary sig-
nificantly. However, the size of on-chip memory that stores this window must be
fixed at design time. Therefore, we firstly perform a design space exploration to
obtain a reasonable size of the on-chip memory (that provides leakage power reduc-
tion and area savings). In case the MB exhibits low motion and the size of the
prefetched window is still less than the on-chip memory, the remaining parts of the
memory are power-gated to further reduce leakage.
Figure 5.17 demonstrates the design space exploration for the memory access
distribution using Ballroom video sequence (a fast motion sequence). Figure 5.17a
shows the number of MBs for which ME and DE require less than 96 MBs. Here, a
MB-based memory fetching is considered. Please note that the reduced number is
mainly due to the adaptive nature of fast ME/DE schemes and it does not mean that
this is within a smaller search range. A rectangular search window of [±96, ±96]
146 5 Energy-Efficient Architectures for Multiview Video Coding
25 100
Histogram
MBs in the Current
20 80
Frame [%]
Region with
15 Low Motion 60 >90% hits
10 High Motion 40
5 20
Accumulated
0 0
0 32 64 96 128 0 32 64 96 128
# MBs acessed in search window
Search Window [±96,±96] ==> Total 144 MBs
Fig. 5.17 Analyzing the memory requirements for ME/DE of different MBs in Ballroom sequence
size requires 37 KB of on-chip memory. Figure 5.17b shows that even for such a
large search range, at most 96 candidates are evaluated per MB. This corresponds to
an on-chip memory of 24 KB, i.e., a reduction of 35 % area. When scrutinizing
Fig. 5.17b, it is noticed that in more than 95 % cases a storage of only 64 MBs is
required (i.e., 16 KB → 57 % savings). We have performed such an analysis for vari-
ous video sequences with diverse motion (not shown here due to space reasons).
Similar observations were made in all of the cases. Therefore, we have selected an
on-chip memory of 16 KB, which provides significant leakage reduction in the on-
chip memory. In rare cases, where the ME and DE may require more MBs, misses
may happen (as we will show in Sect. 5.1). The on-chip memory is organized in 16
banks, where one 16 pixel row of an MB is stored in each of the banks, in order to
guarantee high parallel throughput.
As discussed above, even 16 KB memory may not be completely used to store
the dynamically expanding search window as the size of prefetched search window
highly depends upon the MB properties and the fast ME/DE scheme (it can be seen
in Fig. 5.17b that in more than 20 % of the cases storage for 32 MBs is used, i.e.,
only half of the memory). Therefore, each bank is partitioned into multiple sectors
(eight sectors in this case) where each sector can be individually power-gated to
further reduce the leakage (see Fig. 5.18). The main challenge here is to incorporate
the application and video knowledge to determine the power-gate control, such that
the power-gating signals may be determined depending upon the predicted memory
requirements of the current MB.
Application-Aware
Power Gating
Memory Line Sector ST
Group of sectors
gated together
For the detailed experimental results presented in this section a set video sequences
with four views each was used. The search algorithm used were TZ Search (Yang
2009) and Log Search (JVT 2009a) considering three QP values (22,27,32) and
search in the four possible directions with a search window of [±96, ±96]. The
thresholds set used were N = 6, a = 1, b = 500, and THSAD = 400.
Figure 5.19 presents details for the Search Map and on-chip memory evaluation.
Figure 5.19 shows that the accuracy of Search Map prediction is higher for low-
motion sequences (e.g., Vassar) compared to high-motion sequences (e.g.,
Flamenco2) as the search trajectory is shorter and easier to be predicted (due to a
higher number of stationary/slow-moving MBs). However, even for the worst case
the hits are around 80 % (see Fig. 5.19a). In case of off-chip memory accesses, the
misses are higher for low-motion sequences because the search trajectory tends to
converge to the center (only the central region of search window is accessed) reduc-
ing the overlapping accessed area with the neighboring MBs. The higher number of
memory misses for low-motion sequences, however, does not limit off-chip energy
savings achieved for the same sequences. The reason is that the percentage of
misses is calculated over a much smaller number of total memory accesses for low-
motion sequences.
Figure 5.20 shows the hardware architecture with our proposed dynamic search
window formation scheme. It employs the above-discussed dynamically expanding
Search Map Prediction[%]
80 10
8
60
6
40
4
20 2
0 0
Ballroom Exit Vassar Flamenco Bk Poznan Ballroom Exit Vassar Flamenco Bk Poznan
6
Dancers Hall2 Dancers Hall2
Extract MB
Exrtenal Reference Frame Memory Properties Dynamic Search Search Map
Window Formation and Prediction
Control Prefetching
Unit
ME/DE
Power Gating Control AGU SearchControl
5.6 S
ummary of Energy-Efficient Algorithms for Multiview
Video Coding
Our architectural solution to enable real-time ME/DE is presented along this chap-
ter. Initially, the architectural template and the basic hardware building blocks are
described in Sect. 5.1. Based on this structure a pipelined architecture able to imple-
ment both regular search patterns and the proposed Fast ME/DE algorithm is
described in detail.
Targeting the reduction of the energy consumption related to the external mem-
ory accesses and on-chip video memory leakage, the Dynamic Search Window
Formation strategy is proposed in Sect. 5.3. This solution observes the search pat-
terns of the neighboring MBs in order to anticipate the data required for the current
MB. It allows accurate external memory data prefetching while reducing the on-
chip memory size by avoiding the entire search window storage.
Targeting further energy reduction an application-aware power-gating scheme is
integrated to the ME/DE architecture. Assuming an on-chip memory with sector-
level power-gating capabilities, the application-aware power gating considers the
memory usage characteristics of the future MBs along with the wake-up cost to
define the sectors for power gating. By doing so, this architecture is able to signifi-
cantly reduce the overall energy consumption through minimizing on-chip memory
leakage.
Chapter 6
Results and Comparison
In this chapter the overall results of this work and the comparison with the latest
state-of-the-art approaches are presented. Before moving to the actual comparison,
a description of the experimental setup is presented discussing the fairness of com-
parison in relation to the related works. The benchmark video properties, common
test conditions, simulation environment, and synthesis tool chain are also intro-
duced in this chapter. The results for energy-efficient algorithms are discussed in
terms of complexity reduction while considering the coding efficiency and video
quality in relation to state-of-the-art and optimal solutions. The video quality con-
trol algorithm based on rate control is compared to other rate control techniques
described in the current literature. Energy-efficient architectures are evaluated
against the latest hardware solutions for ME/DE on MVC with emphasis on the
overall energy consumption for both memory access and processing datapath.
Additionally, throughput and IC footprint area are discussed.
In this section are described the simulation, design, and synthesis environment
employed during the development of this work. Afterwards is presented a discus-
sion on the test conditions and benchmark video sequences followed by the fairness
of comparison with the state-of-the-art approaches. The hardware design method
and synthesis tool chain are also presented in this section.
Each algorithm proposed along this monograph was implemented and evaluated
using the reference software platform provided by the Joint Video Team
B. Zatt et al., 3D Video Coding for Embedded Devices: Energy Efficient 151
Algorithms and Architectures, DOI 10.1007/978-1-4614-6759-5_6,
© Springer Science+Business Media New York 2013
152 6 Results and Comparison
(JVT 2009a), the Joint Model for MVC, also known as JMVC. The JMVC is pro-
vided in order to prove the concepts behind the MVC standard and facilitate the
experimentation and integration of new tools to the MVC.
Initially, implementations were described on top of the JMVC 6.0, the latest ver-
sion available by the time this work was started. In face of limitations related to the
simulation of HD1080p sequences (note the use of these sequences were normal-
ized in March 2011 (ISO/IEC 2011) after this work was started), our algorithms
were migrated to a more recent version, the JMVC 8.5, in order to extend our exper-
imental results. Details on the JMVC software structure and the implemented modi-
fications are detailed in Appendix A.
In Table 6.1 is presented a summary of the video encoder settings and parameters
most commonly used for experimentation along this monograph. Note that some
settings may vary depending on the experiments nature. These changes, however,
are mentioned along the results discussion. Table 6.2 describes the computational
processing resources used for simulation.
To allow other researchers to easily compare their results against ours and, conse-
quently, make our results more meaningful to the current literature, the benchmark
video sequences used in our experimental section were derived from the common
6.1 Experimental Setup 153
test conditions recommendations provided by JVT (Su et al. 2006) and ISO/IEC
(2011). In Table 6.3 are presented the video sequence names along with the number
of views, cameras organization, and resolution. The considered video resolutions
are VGA (640 × 480), XGA (1024 × 768), and HD1080p (1920 × 1088—typically
cropped to 1920 × 1080) featuring distinct number of cameras, camera spacing, and
organization. Although some sequences have up to 100 cameras, our experiments
are constrained to four or eight depending on the algorithm under evaluation. Please
consider that the main goal of this monograph is on MVC encoding for mobile
devices that are not expected to feature more than eight cameras. Nevertheless, the
concepts behind the energy reduction algorithms proposed in this monograph are
scalable to increased number of views for applications such 3DTV and FTV.
To support the reader that is not familiar with these video sequences, is provided,
in Fig. 6.1, the spatial, temporal, and disparity indexes (SI, TI, and DI) for each
video sequence referred in Table 6.3. The higher the index the more complex the
154 6 Results and Comparison
60
50
Temporal Index
40
30
20
10
0
0 2 4 6 8 10 12
Spatial Index
70
60
50
Disparity Index
40
30
20
10
0
0 2 4 6 8 10 12
Spatial Index
sequence is in that specific dimension. The goal is to better understand why some
sequences perform better than others under certain coding conditions and/or algo-
rithms. The spatial and temporal indexes were proposed in (ITU-T 1999) and have
been used to classify the benchmark video sequences (Naccari et al. 2011) used to
test the next video coding standard, the High Efficiency Video Coding (HEVC/H.265).
Equations (6.1) and (6.2) give the equations that define SI and TI extended for mul-
tiview videos where ρ(i,j) represents the pixel luminance value in coordinates i and
j, Sobel denotes the Sobel filter operator, and n is the frame temporal index.
Additionally, in order to further adapt to multiview special needs we define the
6.1 Experimental Setup 155
disparity index (DI) based on the same metric used for TI according to Eq. (6.3)
where v is the view index:
Although the experimental results were generated using standard benchmark video
sequences and standard coding settings, it is frequently not possible to directly com-
pare our algorithms with the results provided by the published related works. For
this reason, all state-of-the-art competitors were implemented using our infrastruc-
ture based on the information available in the referred literature. This approach
requires significant implementation effort overhead; however, it ensures that all
algorithms are tested under the same conditions and guarantees the fairness of com-
parison between all proposed solutions. The simulation infrastructure and modifica-
tions applied to the JMVC are presented in Appendix A.
The architectural contribution proposed along this monograph includes complete RTL
(Register Transfer Level) description, functional verification, and logical and physical
synthesis. The hardware was described using VHDL hardware description language
followed by functional verification with Mentor Graphics ModelSim (Mentor
Graphics 2012) using real video test vectors. The standard-cell ASIC synthesis for
65-nm technologies was performed using the Cadence ASIC Tool chain (Cadence
Design Inc. 2012). Two distinct processes and standard-cell libraries were considered
in our hardware results, the IBM 65 nm LPe LowK (Synopsys Inc. 2012) and ST 65 nm
Low Power (Circuits Multi-Projects 2012). For preliminary results, FPGA synthesis
targeting Xilinx FPGAs was performed using the Xilinx ISE tool (Xilinx Inc. 2012).
As mentioned above, the presented architecture was completely designed, inte-
grated, and tested. The only exception is the on-chip SRAM memories featuring
power gating. As far as our memory libraries and memory compiler were not able
to generate such memories, regular SRAM memories were instantiated instead for
connectivity and area approximation. The SRAM energy numbers were extracted
from the related works that describe, implement, and characterize the multiple
power states SRAM memories for 65 nm (Fukano et al. 2008; Zhang et al. 2005;
Singh et al. 2007). With the energy numbers and ME/DE memory traces, a memory
simulator was designed to provide the energy saving results.
156 6 Results and Comparison
In this section is presented the comparison between the energy-efficient mode deci-
sion algorithms proposed in this monograph and the state of the art for fast mode
decision. The efficiency of the algorithms is measured in terms of time savings
compared to the JMVC using RDO-MD. Also, the video quality (PSNR in dB) and
bitrate (BR in # of bits) variations are presented using RD curves and the Bjøntegaard
rate-distortion metric (Tan et al. 2005).
6.2.1.1 Comparing Our Mode Decision Algorithms to the State of the Art
Figure 6.2 presents the percentage time savings compared to RDO-MD for the early
SKIP mode decision (section “Early SKIP Prediction” in Chap. 4), the two strengths
of our multilevel fast mode decision (Relax and Aggressive, Sect. 4.3.1), and two
related works (Han and Lee 2008; Shen et al. 2009b). Each bar represents the aver-
age for all QPs for that specific video sequence and mode decision algorithm. Even
our simplest solution, the early SKIP algorithm, is able to outperform (Shen et al.
2009b) for most of the cases. The work proposed in (Han and Lee 2008) provides
time savings superior to the early SKIP but pays a price in terms of video quality, as
will be discussed soon. The multilevel fast mode decision shows a superior perfor-
mance compared to all competitors and provides up to 79 % time reduction.
Additionally, it provides two complexity reduction strengths that allow handling the
energy saving vs. quality trade-off according to the system state and video content.
70
60
50
40
30
20
10
0
60 60
40 40
20 20
0 0
Han Shen Early SKIP Relax Agressive Han Shen Early SKIP Relax Agressive
1st Quartile
Minimum
60
40
20
0
Han Shen Early SKIP Relax Aggressive
The multilevel mode decision outperforms the state of the art for all cases while
keeping the video quality losses within an acceptable range, as discussed below.
The graph in Fig. 6.3 brings the time savings information detailing its behavior
for multiple QPs considering two video sequences, one VGA and one HD1080p.
This plot shows that our fast MD algorithms are able to sustain the time savings for
the whole QP range due to the QP-based thresholding employed. For instance, the
work of Shen et al. (2009b) employs fixed threshold and suffers from reduced time
savings specially for low QPs. At high QPs, the fixed thresholds tend to incur
increased quality drop. To summarize the complexity reduction results, Fig. 6.4
depicts the distribution of time savings provided by each competitor algorithm con-
sidering all video sequences and QPs tested. In summary, the algorithms proposed
along this monograph provide averagely higher complexity reduction while sustain-
ing significant complexity reduction for any encoding scenario. While the work of
Shen et al. (2009b) shows scenarios with 10 % reduction, the early SKIP prediction
provides at least 38 %. The multilevel fast mode decision ensures time savings
between 55 % and 90 %.
Besides providing complexity reduction, fast mode decision algorithms must
avoid significant video quality losses. In Fig. 6.5 the rate-distortion curves show that
for most of the tested video sequences there is a small displacement compared to the
RDO-MD solution. The Relax level of our multilevel mode decision scheme
158 6 Results and Comparison
PSNR [dB]
41
39
Ballroom 39 Exit
35
37
31 35
500 1500 2500 3500 500 1500 2500
42 43
40 41
PSNR [dB]
38 39
36 Vassar 37 Race1
34 35
32 33
500 1500 2500 3500 500 1500 2500 3500
41
41
PSNR [dB]
39
37 39
Uli Breakdancers
35
37
33
500 1500 2500 3500 1000 3000 5000
44 42
PSNR [dB]
42 40
38 36
1000 3000 6000 1000 3000 5000 7000
Bitrate [Kbps] Bitrate [Kbps]
provides RD results very close to the exhaustive RDO-MD for most of the cases.
The Aggressive level incurs slightly worse RD results, especially for Race1 and
Rena sequences. Please note that our scheme with both Relax and Aggressive levels
provides much higher complexity reduction compared to all schemes, as discussed
earlier. The usage of the Aggressive level is recommended if high-complexity reduc-
tion is desired (e.g., when the battery level of a mobile device is low). Under normal
execution conditions, the Relax level is recommended as it provides superior com-
plexity reduction compared to the state of the art while keeping the RD performance
close to the RDO-MD. In Table 6.4 is summarized the rate-distortion performance
for the discussed mode decision algorithms. Averagely, the early SKIP and relax
solutions present the best RD results. The Aggressive variant of the multilevel fast
mode decision sacrifices RD performance, compared to other competitors, in order
to provide the higher complexity reduction.
6.2 Comparison with the State of the Art 159
36.0
[dB]
35.6
35.2
90 View 0 View 1
Saving [%]
Time
80
70
8 16 24 32 40 48 56
6.2.1.2 C
omparing the Energy-Aware Complexity Adaptation
to the State of the Art
90
Time Savings [%]
80
70
60
50
6.2.1.3 C
omparing the Fast Motion and Disparity Estimation
to the State of the Art
The comparison of our Fast ME/DE with the TZ Search algorithm and state-of-the-
art complexity reduction schemes for ME/DE is presented in this section. Figure 6.7
shows the time savings of our Fast ME/DE algorithms for multiple video sequences
and QPs for 4-view sequences. The TZ Search is used for comparison as it is
23× faster compared to the Full Search, while providing the similar rate-distortion
(RD) results. Compared to the TZ Search, our fast ME/DE provides 83 % execution
time saving. In the best case, the execution time savings go up to 86 %, which
represents a significant computation reduction. These results are possible through
drastic reduction in the number of SAD operations required, as shown in Fig. 6.8.
6.2 Comparison with the State of the Art 161
100000
100
Full Search TZ Lin Tsung Fast ME/DE
36
28
2 6 10 14 1 3 5 7
Bitrate [Mbps] Bitrate [Mbps]
Compared to (Lin et al. 2008) and (Tsung et al. 2009), the number of SAD opera-
tions is reduced in 99 % and 94 %, respectively. It also represents 86 % complexity
reduction compared to the original TZ Search.
The Fast ME/DE algorithm was designed to avoid high-quality drops and bitrate
increases that surpass 10 %; for this reason, it does not result in high rate-distortion
losses. The RD curves presented in Fig. 6.9 summarize the average 0.116 dB quality
reduction and 10.6 % bitrate increase (see detailed table in Sect. 4.4.2) resulting
from the aggressive complexity reduction provided by the proposed algorithm.
Also, the Fast ME/DE demonstrate its robustness in terms of complexity reduction
for all the tested video resolutions and QPs. This characteristic is desirable for real-
time hardware architectures design.
To deal with the quality losses posed by our fast algorithms, we propose, in Sect.
4.5, a complete rate control (RC) solution in order to efficiently manage the video
quality vs. energy trade-off. An efficient RC is supposed to sustain the bitrate as
close as possible to the target bitrate (optimizing the bandwidth usage) while avoid-
ing sudden bitrate variations. To measure the RC accuracy, that is, how close the
162 6 Results and Comparison
2.5
2
1.5
1
0.5
0
actual generated bitrate (Ra) is in relation to the target bitrate (Rt), we use the Mean
Bit Estimation Error (MBEE) [see (4.2)] metric. The average is calculated over all
Basic Units (NBU) along 8 views and 13 GOPs for each video sequence.
Figure 6.10 presents the accuracy in terms of MBEE (less is better) for our HRC
compared to the state-of-the-art solutions (Li et al. 2003; Yan et al. 2009a; Lee and
Lai 2011), and our frame-level RC. On average, our Hierarchical Rate Control pro-
vides 0.95 % MBEE, while ranging from 0.7 % to 1.37 %. The competitors (Li et al.
2003; Yan et al. 2009a; Lee and Lai 2011) and the frame-level RC present, on aver-
age, 2.55 %, 1.78 %, 2.03 %, and 1.18 %, respectively. The HRC reduces the state-
of-the-art error on 0.83 %, on average. The superior accuracy is a result of the ability
to adapt the QP jointly at frame and BU levels while considering the 3D-Neighborhood
correlation and the video content properties.
In Fig. 6.11 the long-term behavior of distinct Rate Control schemes is presented
in terms of accumulated bitrate. A more accurate RC maximizes the use of available
bandwidth and, consequently, the accumulated bitrate tends to stay closer to the
target bitrate line. After a few initial GOPs required for control stabilization, our
HRC curve better fits to the target bitrate followed by our frame-level RC, as shown
in Fig. 6.11. JMVC without RC presents the worst bandwidth usage, as expected.
Once the accuracy of our HRC is proven we present the rate-distortion (RD)
results to show that overall video quality and quality smoothness are not compro-
mised. Table 6.5 summarizes the objective rate-distortion in terms of BD-PSNR
(Bjøntegaard Delta PSNR) and BD-BR (Bjøntegaard Delta Bitrate) in relation to
JMVC without RC. The HRC provides 1.86 dB BD-PSNR increase along with
BD-BR reduction of 40.05 %, on average. If compared to the work of Lee and Lai
(2011), which presents the best RD performance among the related works, the HRC
delivers 0.06 dB increased BD-PSNR and 3.18 % reduced BD-BR. Remember,
besides superior RD performance, the HRC also outperforms (Lee and Lai 2011) in
terms of accuracy (1.08 % MBEE).
6.2 Comparison with the State of the Art 163
2.0
Accumulated #Bit (bitsx106)
1.5
Target
1.0 JMVC 8.5
Li
Yan
Lee
0.5
Frame-Level
HRC
0.0
1 4 7 10
GOP
Figure 6.12 shows the RD curves for different video sequences considering
v ideos from distinct spatial, temporal, and disparity indexes. The HRC shows its
superiority in relation to the state of the art for most of the RD curves. It is also
important to highlight that HRC does not insert visual artifacts such as blurring and
blocking noise. Moreover, our RC does not compromise the borders sharpness
typically lost in case of bad QP selection.
Results and comparison to the state of the art for the proposed energy-efficient
hardware architecture are presented in this section. Table 6.6 summarizes the hard-
ware implementation results with details on gate count, size of on-chip memory,
164 6 Results and Comparison
38
38 37
36
34 36 35
32 34 33
256 512 768 1024 256 512 768 1024 512 1024 1536 2048
Bitrate [Kbps] Bitrate [Kbps]
38
36 37 JMVC
36 Li
34 34 35 Yan
Lee
32 32 33 Frame-Level
256 512 768 1024 256 512 768 1024 512 1024 1536 Our
2048
Bitrate [Kbps] Bitrate [Kbps] Bitrate [Kbps]
Memory Bank 0
Memory Bank 1
Memory Bank 2
Memory Bank 3
Memory Bank 4
Memory Bank 7
Memory Bank 8
Memory Bank 9
Memory Bank 10
Memory Bank 13
APM
Memory Bank 14
80
70
60
50
Ballroom Exit Vassar Flamenco
At first analysis, the main drawback of the proposed ME/DE architecture lies in
the increase on-chip video memory in comparison to the state of the art. The on-chip
memory in our hardware is relatively larger as it supports a much bigger search
window of up to [±96, ±96] compared to [±16, ±16] in Tsung et al. (2009) (which is
insufficient to capture larger disparity vectors). However, the larger on-chip memory
does not imply an increased power dissipation because of the dynamic power man-
agement and power-gating techniques employed in our solutions. Compared to Zatt
et al. (2011c) an on-chip memory reduction of about 30% is achieved.
The authors of Tsung et al. (2009) use a rectangular data reuse technique such as
Level-C (Chen et al. 2006), which compared to our proposed solutions (search as
dynamic window formation) perform inefficiently. Note, Level-C (Chen et al. 2006)
with a search window of [±96, ±96] would require 4 memories of 288 Kb (i.e., a
total of 1,115 Mb) to exploit the reusability in four possible prediction directions
available in MVC. Our approach implements it with 512 Kbits. To perform a fair
comparison, we have deployed the Level-C and Level-C+ (Chen et al. 2006) tech-
niques in our hardware architecture.
Figure 6.14 shows the energy benefit of employing our dynamically expanding
search window and multibank on-chip memory with power gating. Compared to
Level-C and Level-C+ (Chen et al. 2006) prefetching techniques (based on rectan-
gular search windows), our approach presents energy reduction in on-chip and off-
chip memories as shown in Fig. 6.14. For a search window of [±96, ±96], our
approach provides an energy reduction of up to 82–96 % and 57–75 % for off-chip
and on-chip memory access, respectively. These significant energy savings are due
to the fact that Level-C and Level-C+ (Chen et al. 2006) suffer from a high data
retransmission for every first MB in the row. Additionally, our approach provides
higher data reuse and incurs reduced leakage due to a smaller on-chip memory and
power gating of the unused sectors.
To cope with comparison fairness issues, along this chapter were detailed all set-
tings and videos used for comparison along this monograph. The video benchmark
sequences were classified using the spatial, temporal, and disparity indexes
6.3 Summary of Results and Comparison 167
The presented monograph focuses on the energy reduction of the Multiview Video
Coding (MVC) encoder to enable the realization of real-time high-definition
3D-video encoding running on mobile embedded devices with battery-constrained
energy. For that, novel energy-efficient techniques are proposed at both algorithmic
and architectural abstraction levels. The joint consideration of algorithms and
underlying hardware architecture is the key enabler to provide improved energy
efficiency, as demonstrated along this monograph.
The strong correlation within the 3D-Neighborhood domain, concept defined in
this work, has been the base for designing most of the algorithms and hardware
architecture adaptation schemes proposed. An extensive study based on statistical
analysis correlating MVC coding side information (such as coding modes, motion/
disparity vectors, and RDCost) to the video content properties is provided to justify
the importance of the 3D-Neighborhood understanding and to demonstrate its
potential to support energy reduction in the MVC encoder.
A set of energy-efficient algorithms for MVC compose one of the major contribu-
tions to the state of the art proposed in this work. A multilevel fast mode decision
algorithm with 6 levels is described targeting energy efficiency through complexity
reduction. The Early SKIP prediction, one stage of our scheme, exploits the high
occurrence of SKIP coded MBs to accelerate the encoding process by employing
statistical methods that define if each MB is in the high SKIP probability region in
order to avoid other coding modes evaluation. Our algorithm eliminates coding
modes evaluation even in the case where an early SKIP is not detected by analyzing
the coding modes available within the 3D-Neighborhood while considering a video/
RDCost-based mode ranking. The video properties are also used to define block
sizes and prediction modes orientation. To protect the multilevel fast MD algorithm
from inserting excessive quality losses an early termination test is inserted between
each prediction step. This algorithm defines QP-based thresholds for two distinct
energy reduction strengths: the relax and aggressive strengths. By employing two
operation modes it is possible to select the best energy vs. quality trade-off for a
given system state and video content. Moreover, multiple fast MD modes enable the
B. Zatt et al., 3D Video Coding for Embedded Devices: Energy Efficient 169
Algorithms and Architectures, DOI 10.1007/978-1-4614-6759-5_7,
© Springer Science+Business Media New York 2013
170 7 Conclusion and Future Works
Beyond the contribution brought in this monograph work, there are multiple
research topics related to 3D-video coding and video processing that were not
addressed in this volume. The algorithms and architectures here presented were
centered in mode decision and motion and disparity estimation once these are the
172 7 Conclusion and Future Works
Although the main challenges in terms of complexity and energy consumption are
related to the MD and ME/DE blocks, attending to the MVC demands while respect-
ing energy constraints presents challenges related to other MVC processing blocks.
The entropy encoder, for instance, may become the bottleneck of the encoder sys-
tem if no proper parallelization is employed. The block-level data dependencies in
intra prediction also require research attention. Finding efficient solutions to deal
with data dependencies and parallelization issues provide interesting research
opportunities for future works.
Video encoding is one single stage in the 3D-video system. Between video captur-
ing and video coding phases, there is a need for preprocessing such as geometrical
calibration (for correcting the alignment of the multiple videos) and color correction
(responsible for equalizing the brightness level and color gamut). After the trans-
mission and decoding, the video is processed for displaying depending on the appli-
cation and display technology. This post-processing phase includes color space
mapping (in a system using color polarization), resolution scaling, and viewpoint
synthesis (generation of intermediate viewpoints for displaying). The pre- and post-
processing implement complex and data-intensive algorithms (especially for view-
point synthesis) that run concurrently with the video encoder/decoder and require
real-time performance. Therefore, the embedded energy and hardware resources
must be shared to attend both video coding and pre/post-processing demands.
The next generation for 3D-video coding is currently referred as 3DV (3D-Video)
(ISO/IEC 2009) and is based on the Video + Depth concept that defines distinct
7.1 Future Works 173
channels to transmit video and the depth maps. The 3DV is expected to be defined
as an extension to the HEVC/H.265 (Sullivan and Ohm 2010). The 3DV tools will
bring a completely new set of challenges boosting the research topics related to
3D-multimedia. Moreover, the video coding standards lifetime is expected to reduce
for future standard generations resulting in the simultaneous coexistence of multiple
coding standards. Thus, there is a need to support multiple complex coding stan-
dards in the same device by employing flexible and adaptive solutions.
Appendix A
JMVC Simulation Environment
The JVT (Joint Video Team), formed from the cooperation between the ITU-T
Study Group 16 (VCEG) and ISO/IEC Motion Picture Experts group (MPEG),
responsible for the standardization of the H.264, SVC (scalable video coding), and
MVC provides software models used for algorithms experimentation and for stan-
dards proof of concept. The JMVC (Joint Model for MVC), currently on version
8.5, is the reference software available for experimentation on the MVC standard.
Along this work the JMVC software, coded using C++, was used and modified to
implement the proposed algorithms. Initially, the version 6.0 was used followed by
an upgraded to version 8.5. Considering the length and complexity of the software,
a high-level overview of the interaction between the main encoder classes is pre-
sented here. Afterwards are shown the classes modified to enable our algorithms
experimentation. For in-depth details of the class structure refers to JMVC docu-
mentation .
The JMVC classes are hierarchically structured as shown in Fig. A.1. The JMVC
encodes each view at a time requiring as many calls as number of view to be
encoded. The reference views are stored in temporary files. The class
H264AVCEncoderTest represents the top encoder entity; it initializes the encoder,
and calls the CreateH264AVCEncoder class to initialize the other coding classes. At
this level, the PicEncoder is initialized and the frame-level loop is controlled. The
PicEncoder loops over the slices inside each frame and resets the RDcost. The slice
encoder controls the MB-level loop and sets the reference frames for each slices.
For MB encoding there are two main classes: the MbEncoder and the MBCoder.
MbEncoder encapsulates all the prediction, transforms, and entropy steps. It imple-
ments the mode decision by looping over and encoding all possible coding modes
(in case of RDO-MD) and determining the minimum RDCost. At this point no MB
B. Zatt et al., 3D Video Coding for Embedded Devices: Energy Efficient 175
Algorithms and Architectures, DOI 10.1007/978-1-4614-6759-5,
© Springer Science+Business Media New York 2013
176 Appendix A
JMVC Encoder
H264AVCEncoderTest
(H264AVCEncoderTest.cpp)
-Init encoder
-Setup frame buffers
-Loop over frames
CreateH264AVCEncoder
(CreaterH264AVCEncoder.cpp)
MbEncoder MbCoder BS
(MbEncoder.cpp) (MbCoder.cpp)
- Loop over coding modes - Encode the best mode
- Perform the encoding for each mode - Create the bitstream
- Select the best RDCost
coding data is written to the bitstream. Once the best mode is selected, the
SliceEncoder calls the MbCoder to write the MB-level side information and resi-
dues to the bitstream.
Figure A.2 (Tech et al. 2010) depicts the hierarchical call graph of methods
inside the mode decision process implemented in MbEncoder class. Firstly, the
SKIP and Direct modes are evaluated,;along this monograph these modes are jointly
referred as SKIP MBs. In the following, all inter-prediction block sizes are evalu-
ated. For each partition size a call to the method MotionEstimation::estimateBlock
WithStart (see Fig. A.3 discussion) is performed. The same happens for the sub-
partitions in case of 8 × 8 partitioning. EstimateMb8 × 8Frext is only called in case
the FRExt flag is set. Finally, the intra-frame coding modes including PCM,
intra4 × 4, intra8 × 8 (FRExt only), and intra16 × 16 are called. The Estimate<mode>
methods call the complete coding loop for that specific mode including prediction,
transforms, quantization, entropy encoding, and reconstruction. It allows a precise
definition of the minimum RDCost (λ) and an optimal best mode selection at the
cost of elevated coding complexity. The MbCoder is called to entropy encode the
best mode and write the data into the bitstream output buffer.
The motion and disparity estimation search itself is defined in the method esti-
mateBlockWithStart and is composed of three basic steps. The ME/DE dataflow is
represented by the arrows in Fig. A.3. Once the estimateBlockWithStart is called,
for instance in EstimateMb16 × 16, the search runs for each reference frame list
A.1 JMVC Encoder Overview 177
EstimateMbSkip Inter P
EstimateMbDirect Inter B
Inter P
EstimateSubMbDirect
Inter B
estimateBlock
EstimateMb8x8Frext 4x min EstimateSubMb8x8 1x WithStart
EstimateMb16x16 1x estimateBlock
WithStart
EstimateSubMbDirect
estimateBlock
Minimum EstimateMb16x8 2x estimateBlock
WithStart
EstimateSubMb8x8 1x WithStart
RDCost estimateBlock
min EstimateMb8x16 2x estimateBlock
WithStart
EstimateSubMb8x4 2x WithStart
Best
Mode estimateBlock
EstimateMb8x8 4x min EstimateSubMb4x8 2x WithStart
EstimateSubMb4x4 4x estimateBlock
WithStart
EstimateMbPCM Intra
InterSearch
Minimum Search List 0 min ME/DE Sub Pel min ME/DE Full Pel
MotionCost
min Search List 1 min ME/DE Sub Pel min ME/DE Full Pel
Interactive B Prediction min ME/DE Sub Pel min ME/DE Full Pel
(List 0 and List 1) and for an interactive B search mode that exploits both lists in an
interactive fashion (in case the interactive B is active). At the software perspective,
there is no distinction between ME and DE. List 0 and List 1 store both temporal
and disparity reference frames. The search for a given reference frame firstly finds
the best candidate block among the integer pixels (ME/DE Full Pel) and then refines
the result considering half and quarter pixels (ME/DE Sub Pel). The search pattern
depends on the search algorithm, and JMVC implements TZ Search, Full Search,
Spiral Search, and Log Search. The goal is to find the candidate block that mini-
mizes the Motion Cost (λMotion) in terms of SAD, SAD-YUV (considering chroma
channels), SATD, or SSE according to user-defined coding parameters. The posi-
tion of the best matching candidate block position defines the motion or disparity
vector.
178 Appendix A
In order to generate the statistics used for coding modes and motion/disparity vec-
tor, some modifications were done in the original JMVC code. The point selected
for this tracing is inside the entropy encoder to guarantee that the extracted data is
the same actually encoded and transmitted. The entropy encoder is declared as the
virtual class MbSymbolWriteIf, but the actual implementation is in CabacWriter
and UvlcWriter, depending on the entropy encoder selected in the configuration file.
The methods monitored are skipFlag that encodes the SKIP (and Direct)-coded
MBs and mbMode that encodes all other modes. Note, MB coding mode codes
(uiMbMode) vary with the slice type as defined in Tables 7-11, 7-12, 7-12, 7-13, and
7-14 of the MVC standard (JVT 2008).
Multiple algorithms proposed in this monograph employ the information from the
3D-neighborhood. For that, there is a need to build communication channels
between neighboring MBs in the special, temporal, and disparity domains. In other
words, an infrastructure to send and receive data at MB level, at frame level (in same
view), and at view level (frames in different views). Therefore, a hierarchical com-
munication infrastructure was designed and implemented. Figure A.4 presents
graphically the modified classes along with the new member data structures and
communication methods.
The MbDataAccess already provides direct access to the left and upper neigh-
bors (A, B, C, and D in Fig. A.4). This access was extended to the right and bottom
neighbors (A*, B*, C*, and D*) enabling access to data from all spatial neighboring
#MBx
O
S
P
Send() Send() D B C
#MBy
#MBy
MB A MB A*
Frame
Frame Receive() Receive() C* B* D*
MbDataAccess::Send(MB,x,y)
SliceHeader::Send(slice.POC) MbDataAccess::Receive(MB,x,y)
PicEncoder ::Read(file, view) SliceHeader::Receive(slice,POC) MbDataAccess::Write(MB,x,y)
Read() Write()
MB
The mode decision is programmed in a very simple becoming easy to find and
modify. MD is handled in MbEncoder::encodeMacroblock. To find the exact point,
search for the xEstimateMb methods responsible for calling the modes evaluation.
Before this point are implemented the 3D-neighborhood communication calls and
the calculation required to take the fast decisions.
The modifications for fast ME/DE are inserted in two distinct classes. For modifica-
tions at higher level such as avoiding interactive B search, search direction, and
reference frames the modifications are done in the MbEncoder by modifying the
xEstimateMb methods. If the modifications are in the search step itself,
MotionEstimation class is the right point for modification. estimateBlockWithStart
method is responsible for fetching the image data, prediction SKIP vectors, and
calling the search methods (xPelBlockSearch, xPelSpiralSearch, xPelLogSearch,
and xTZSearch). By modifying these methods it is possible to reach low-level modi-
fications on the ME/DE search.
The JMVC does not implement any rate control algorithms. Therefore, to implement
the hierarchical rate control (HRC) scheme one new class is created, the RateControl.
Three files are used to better partition the RC hierarchy. File RateCtlCore.cpp
describes the behavior of the whole HRC while RateCtlMPC.cpp and RateCtlUB.
cpp are responsible for the calculations relative to the MPC and MDP controllers.
RateCtl.h file is used to define the MPC and MDP actuation parameters. The QP
180 Appendix A
history is read from CodingParameter class and the generated bitrate is accessed via
BitWriteBuffer and BitCounter. The QPs defined for the next frames or BU are sent
back to CodingParameter. Additional modifications were required in files MbCoder.
cpp, CodingParameter.cpp, RateDistortion.cpp, ControlMngH264AVCEncoder.
cpp, Multiview.cpp, and ControlMngH264AVCEncoder.h.
Appendix B
Memory Access Analyzer Tool
The MVC Viewer software is used as part of this work to plot and analyze the
memory accesses that are required by the motion and disparity estimation (ME/
DE). The goal of this tool is to help the researchers in their projects in the visual and
statistical analysis of the communication between the multiview video encoder and
the reference samples memory. It provides a set of final statistics and several plots
using the original input video.
The MVC Viewer was designed to be adapted to different encoder parameters. In
a configuration file, the user should specify (a) the number of views, (b) the GOP
size, (c) the video resolution, (d) the original YUV video files path, and, finally, (e)
the memory tracing input files path. The tracing file is an intermediated way to com-
municate the video encoder output, like JM or ×264, with the MVC Viewer tool. In
this file, all memory accesses performed by ME/DE are listed.
This tool runs over the JVM (Java Virtual Machine) and provides a simple inter-
face to the analysis. Figure B.1 presents the overview of the MVC Viewer main
screen. The main parts are as follows:
1. Encoding parameters: GOP Size, number of coded frames, number of coded
views, and video resolution (directly defined in the configuration files)
2. Tracing files path where all accessed regions of reference frames are listed
3. Original YUV videos
4. Program mode selection: the MVC Viewer has mainly two possible analysis
tools:
(a) Current macroblock-based analysis
(b) Reference frame-based analysis
5. Listbox with all memory access that will be plotted in the output
The two analyses that are allowed by the MVC Viewer tool will be explained in
the next sections.
B. Zatt et al., 3D Video Coding for Embedded Devices: Energy Efficient 181
Algorithms and Architectures, DOI 10.1007/978-1-4614-6759-5,
© Springer Science+Business Media New York 2013
182 Appendix B
In this analysis, the goal is to trace all accessed reference frame samples when the
ME/DE is performed for one or more current macroblocks. The MVC massively
uses multiple reference frames, and then the MVC Viewer will generate several
plots that will determine the accessed regions for each reference frame (temporal
and disparity neighbors). The Fig. B.2 shows a MVC Viewer screenshot when it is
running this analysis. The main parts are as follows:
1. Selection of the target macroblocks that will be traced
2. List of all selected macroblocks
3. List of all memory access caused by the ME/DE for the selected macroblocks
Fig. B.3 presents an output example for one macroblock that reflects in samples
accesses in the four directions: past and future temporal reference frames, and right
and left disparity reference frames.
This analysis selects one specific frame and traces all accesses performed by the
ME/DE when the selected frame is used as reference. This way, it is possible to
determine the most accessed regions of the frame. The knowledge about this behav-
ior is important to define strategies to save memory bandwidth. Figure B.4 presents
the MVC Viewer during this analysis, where the main parts are as follows:
1. Reference frame selection: the user must define the frame identification (the
view and frame positions) to be traced.
B.2 Search Window-Based Analysis 183
Fig. B.3 Output example: four prediction directions and their respective accessed areas
2. Current MBs tracing option: the user has the possibility to delimit an area inside
the reference frame to discover which are the current blocks processed by the
ME/DE that cause the accesses.
3. List of all memory access caused by the ME/DE in the selected reference
frame.
184 Appendix B
Fig. B.5 Output example: reference frame access index considering two block matching algo-
rithms: full search and TZ search
The Fig. B.5 presents two different examples of the reference frame-based analy-
sis considering two search algorithms: (a) full search and (b) TZ search.
The full search has a regular access pattern where all samples inside the search
window are fetched. On the other hand, the TZ Search has a heuristic behavior and
the access index varies in accordance with the video properties (low/high motion/
disparity). These two different cases are represented in the plots of the Fig. B.5.
Appendix C
CES Video Analyzer Tool
The CES Video Analyzer tool was developed in house targeting the displaying and
analyzing of video properties. It was described in C# programming language and
features the graphic user interface presented in Fig. C.1. The goal of the original
tool is to support the decision making during novel coding algorithms design. The
tool support diverse displaying modes including luminance-only mode and apply-
ing MB grids. Also, the CES Video Analyzer implements image filters such as
Sobel, Laplace, Kirsch, and Prewitt filters besides luminance, gradient, and variance
maps. An additional information window summarizes all image properties.
Figure C.2 exemplifies the tool features presenting the original frame with the MB
grid, the Sobel filtered image, and the variance map.
To facilitate the development of the algorithms proposed in this volume, the CES
Video Analyzer was extended to support and provide better visualization for MVC
videos. Figure C.3 shows the visualization of a frame differentiating SKIP, inter,
and intra MBs. In Fig. C.4 all MBs, including SKIPs, are classified in disparity
estimation or motion estimation for different time instants.
B. Zatt et al., 3D Video Coding for Embedded Devices: Energy Efficient 185
Algorithms and Architectures, DOI 10.1007/978-1-4614-6759-5,
© Springer Science+Business Media New York 2013
186 Appendix C
Abbo AA et al (2008) Xetal-II: a 107 GOPS, 600 mW massively parallel processor for video scene
analysis. IEEE J Solid-State Circuits 43:192–201
Agarwal K et al (2006) Power gating with multiple sleep modes. In: International symposium on
quality electronic design, 27-29 March 2006, San Jose, CA, pp 633–637
Agrafiotis D et al (2006) Multiple priority region of interest coding with H.264. In: IEEE interna-
tional conference on image processing, ICIP, October 2006, Atlanta pp 53–56
Akin A, Sayilar G, Hamzaoglu I (2010) A reconfigurable hardware for one bit transform based
multiple reference frame motion estimation. In: Design, automation and test in europe, EDAA,
Milão, Milan, Italy, pp 393–398
Arapostathis A, Kumar R, Tangirala S (2003) Controlled Markov chains with safety upper bound.
IEEE Trans Automat Contr 48:1230–1234
ARM Ltd. (2012) ARM—the architecture for the digital world. http://www.arm.com/
Arsura E et al (2005) Fast macroblock intra and inter modes selection for H.264/AVC. In:
International conference on multimedia and expo (ICME), July 6-8, 2005, Amsterdam, The
Netherlands, pp 378–381
BartoAndrew G (1994) Reinforcement learning control. Curr Opin Neurobiol 4:888–893
Bauer L et al (2007) RISPP: rotating instruction set processing platform. In: 44th Design automa-
tion conference, San Diego, CA, pp 791–796 [s.n.]
Bauer L et al (2008a) Run-time system for an extensible embedded processor with dynamic
instruction set. In: Design, automation and test in Europe, Munich, Germany, pp 752–757 [s.n.]
Bauer L, Shafique M, Henkel J (2008b) Run-time instruction set selection in a transmutable
embedded processor. In: 45th Design automation conference, pp 56–61
Beck Antonio Carlos S et al (2008) Transparent reconfigurable acceleration for heterogeneous
embedded applications. In: Design automation conference, 8-13 June, Anaheim, CA,
pp 1208–1213
Bennett Kyle (2011) Intel Core i7-3960X—Sandy Bridge E Processor Review, HardOCP,
November 2011. http://hardocp.com/article/2011/11/14/intel_core_i73960x_sandy_bridge_e_
processor_review/4
Berekovic M et al (2008) Mapping of nomadic multimedia applications on the ADRES reconfigu-
rable array processor. Microprocess Microsyst 33:290–294
Bhaskaran V, Konstantinides K (1999) Image and video compression standards: algorithms and
architectures. Kluwer Academic, Boston, MA
Blanche P-A et al (2010) Holographic three-dimensional telepresence using large-area photore-
fractive polymer. Nature 468:80–83
B. Zatt et al., 3D Video Coding for Embedded Devices: Energy Efficient 189
Algorithms and Architectures, DOI 10.1007/978-1-4614-6759-5,
© Springer Science+Business Media New York 2013
190 References
Blu-ray Disc Association (2010) White paper blu-ray disc read-only format. http://www.blu-
raydisc.com/assets/Downloadablefile/BD- ROM_Audio_Visual_Application_Format_
Specifications-18780.pdf
Cadence Design Systems, Inc. (2012) Digital implementation. http://www.cadence.com/products/
di/Pages/default.aspx
Cao Z et al (2010) Optimality and improvement of dynamic voltage scaling algorithms for multi-
media applications. IEEE Trans Circuits Syst I Regul Pap 57:681–690
Chan C-C, Tang C-W (2012) Coding statistics based fast mode decision for multi-view video cod-
ing. J Vis Commun Image Represent. doi: 10.1016/j.jvcir.2012.01.004
Chang H-C et al (2009) A dynamic quality-adjustable H.264 video encoder for power-aware video
applications. IEEE Trans Circuits Syst Video Technol 12:1739–1754
Chang NY-C et al (2010) Algorithm and architecture of disparity estimation with mini-census
adaptive support weight. IEEE Trans Circuits Syst Video Technol 20:792–805
Chen JC, Chien S-Y (2008) CRISP: coarse-grained reconfigurable image stream processor for
digital still cameras and camcorders. IEEE Trans Circuits Syst Video Technol 18:1223–1236,
1051-8215
Chen Z, Zhou P, He Y (2002) Fast integer pel and fractional pel motion estimation for JVT, JVT-
F017, Joint Video Team (JVT) of ISO/IECMPEG & ITU-T VCEG 6th Meeting, Awaji, JP
Chen C-Y et al (2006) Level C+ data reuse scheme for motion estimation with corresponding cod-
ing orders. IEEE Trans Circuits Syst Video Technol 16:553–558
Chen T-C et al (2007) Fast algorithm and architecture design of low-power integer motion estima-
tion for H.264/AVC. IEEE Trans Circuits Syst Video Technol 17:568–577
Chen Y et al (2009a) Coding techniques in multiview video coding and joint multiview video
model. In: Picture coding symposium, IEEE, Piscataway, NJ, pp 313–316
Chen Y et al (2009b) The emerging MVC standard for 3D video services. In: 3DTV conference,
May 04-06 2009,Potsdam, Germany, vol 2009, pp 1–13
Chen Y-H et al (2009c) Algorithm and architecture design of power-oriented H.264/AVC baseline
profile encoder for portable devices. IEEE Trans Circuits Syst Video Technol 19:1118–1128,
1051-8215
Chien S-Y et al (2008) An 8.6 mW 25 Mvertices/s 400-MFLOPS 800-MOPS 8.91 mm multimedia
stream processor core for mobile applications. IEEE J Solid-State Circuits 43:2025–2035
Chiu J-C, Chou Y-L (2010) Multi-streaming SIMD multimedia computing engine. Microprocess
Microsyst 34:247–258
Chuang T-D et al (2010) A 59.5mW scalable/multi-view video decoder chip for Quad/3D Full
HDTV and video streaming applications. In: IEEE international conference on solid-state cir-
cuits (ISSCC), 7-11 Feb, San Francisco, CA, pp 330–331
Circuits Multi-Projects (2012) STMicroelectronics deep sub-micron processes. http://cmp.imag.
fr/aboutus/slides/slides2007/04_KT_ST.pdf
CISCO (2012) Cisco visual networking index: global mobile data traffic forecast update,
2011–2016. http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/
white_paper_c11-520862.pdf
Cong J et al (2009) Automatic memory partitioning and scheduling for throughput and power
optimization. In: International conference on computer aided design (ICCAD), November 2-5,
2009, San Jose, CA, pp 697–704
de-Frutos-López M et al (2010) An improved fast mode decision algorithm for intraprediction in
H.264/AVC video coding. Signal Process Image Commun 25:709–716
Deng Z-P et al (2009) A fast view-temporal prediction algorithm for stereoscopic video coding. In:
International congress on image and signal processing (CISP), 17-19 October 2009, Tianjin,
China. pp 1–5
Díaz-Honrubia Antonio J, Martínez José Luis, Cuenca Pedro (2012) HEVC: a review, trends and
challenges. In: Workshop on multimedia data coding and transmission, Sep. 2012, WMDCT,
Alicante, Spain
Ding L-F et al (2008a) Content-aware prediction algorithm with inter-view mode decision for
multiview video coding. IEEE Trans Multimedia 10:1553–1564
References 191
Ding L-F et al (2008b) Fast motion estimation with inter-view motion vector prediction for stereo
and multiview video coding. In: International conference on acoustics speech and signal pro-
cessing (ICASSP), Las Vegas, NV, March 30 – April 4, 2008, pp 1373–1376
Ding L-F et al (2010a) A 212 MPixels/s 4096 2160p multiview video encoder chip for 3D/Quad
full HDTV applications. IEEE J Solid-State Circuits 45:46–58
Dodgson NA (2005) Autostereoscopic 3D displays. IEEE Comput 38:31–36
Dolby (2012) Dolby 3D. http://www.dolby.com/us/en/consumer/technology/movie/dolby-3d.html
Erdayandı K (2009) JMVC documentation. In: JMVC—JVT-AD207. http://students.sabanciuniv.
edu/~kerdayandi/jmvc/index_jmvc.html
Finchelstein DF, Sze V, Chandrakasan AP (2009) Multicore processing and efficient on-chip cach-
ing for H.264 and future video decoders. IEEE Trans Circuits Syst Video Technol
19:1704–1713
Fujifilm (2011) FinePix REAL 3D W3 | FujiFilm Global. http://www.fujifilm.com/products/3d/
camera/finepix_real3dw3/
Fujii T (2010) Panel Discussion 1 (D1)—3DTV/FTV. In: Picture coding symposium (PCS),
December 7-10, 2010, Nagoya, Japan
Fukano G et al (2008) A 65nm 1Mb SRAM macro with dynamic voltage scaling in dual power
supply scheme for low power SoCs, In: Joint Non-Volatile Semiconductor Memory Workshop
and International Conference on Memory Technology and Design (NVSMW/ICMTD), Opio,
France, pp 97–98
García Carlos E, Prett David M, Morari M (1989) Model predictive control: theory and practice—a
survey. Automatica 25:335–348
Gassée J-L (2010) Intel’s bold bet against ARM: visionary or myopic? Monday Note. http://www.
mondaynote.com/2010/06/27/intel%E2%80%99s-bold-bet-against-arm-visionary-or-myopic/
Ghanbari M (1990) The cross-search algorithm for motion estimation. IEEE Trans Commun
38:950–953
Grecos C, Yang MY (2005) Fast inter mode prediction for P slices in the H264 video coding stan-
dard. IEEE Transactions On Broadcasting, Vol. 51, No. 2, June 2005. 256–263
Grun P, Balasa F, Dutt N (1998) Memory size estimation for multimedia applications, In:
International Workshop on Hardware/Software Codesign (CODES/CASHE), California
University Press, Irvine, CA, pp 145–149
Han D-H, Lee Y-L (2008) Fast mode decision using global disparity vector for multiview video
coding. In: Future generation communication and networking symposia (FGCNS), December
13-December 15 2008. Proceedings of the 2008 Second International Conference on Future
Generation Communication and Networking Symposia - Volume 01. IEEE Computer Society,
Washington, DC, USA. pp 209–213
He Z, Cheng W, Chen X (2008) Energy minimization of portable video communication devices
based on power-rate-distortion optimization. IEEE Trans Circuits Syst Video Technol
18:596–608
Yu-Wen Huang; Bing-Yu Hsieh; Shao-Yi Chien; Shyh-Yih Ma; Liang-Gee Chen, Analysis and
complexity reduction of multiple reference frames motion estimation in H.264/AVC, IEEE
Transactions on Circuits and Systems for Video Technology, vol.16, no.4, pp.507,522, April
2006. doi: 10.1109/TCSVT.2006.872783
Huang Y-H, Ou T-S, Shen H (2009) Fast H.264 selective intra mode decision for inter-frame
coding. In: Picture coding symposium (PCS), Chicago, Illinois, USA, on May 6-8, 2009,
pp 377–380
Huff HR, Gilmer DC (2004) High dielectric constant materials: VLSI MOSFET applications.
Springer, New York, NY
IC Insights (2012) IC Insights raises forecast for tablets, notebooks, total PC shipments. Electronic
specifier. http://www.electronicspecifier.com/Tech-News/IC-Insights-Raises-Forecast-Tablets-
Notebooks-Total-PC-Shipments.asp
IMAX (2012) IMAX3D. http://www.imax.com/about/imax-3d/.
ISO/IEC (2009) Vision on 3D video. http://mpeg.chiariglione.org/visions/3dv/index.htm
192 References
ISO/IEC (2011) Common test conditions for MVC W12036. ISO/IEC JTC1/SC29/WG11,
Meeting MPEG 96, Geneva, Switzerland, March de 2011
ITU-T (1999) Subjective video quality assessment methods for multimedia applications—P.910 //
Series P: telephone transmission quality, telephone installations, local line networks
Javed H et al (2011) Low-power adaptive pipelined MPSoCs for multimedia: an H.264 video
encoder case study. In: Design automation conference, JUNE 2-6, San Diego, CA,
pp 1032–1037
Jeon BW, Lee JY (2003) Fast mode decision for H.264—Document JVT-J033. Joint Video Team
(JVT) of ISO/IECMPEG & ITU-T VCEG 8th Meeting, Waikoloa, HI
Ji W, Li P, Chen M, and Chen Y (2009) Power Scalable Video Encoding Strategy Based on Game
Theory. In Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in
Multimedia Information Processing (PCM ‘09), Paisarn Muneesawang, Feng Wu, Itsuo
Kumazawa, Athikom Roeksabutr, Mark Liao, and Xiaoou Tang (Eds.). Springer-Verlag, Berlin,
Heidelberg, 1237–1243.
Jiang M, Yi X, Ling N (2004) Improved frame-layer rate control for H.264 using MAD ratio. In:
International symposium on circuits and systems, ISCAS, 23-26 May, Vancouver, Canada
Jing X, Chau L-P (2001) An efficient three-step search algorithm for block motion estimation,”
Multimedia, IEEE Transactions on, vol.6, no.3, pp. 435,438, June 2004. doi: 10.1109/
TMM.2004.827517
Jing X, Chau L-P (2004) Fast approach for H.264 inter mode decision, Electronics Letters , vol.40,
no.17, pp.1050,1052, 19 Aug. 2004. doi: 10.1049/el:20045243
JVT (2003) Draft ITU-T Rec. and final draft international standard of joint video specification
JVT (2008) Joint draft 8.0 on multiview video coding—JVT-AB204
JVT (2009a) JMVC 6.0 [garcon.ient.rwthaachen.de]
JVT (2009b) Joint multiview video coding
Kamat SP (2009) Energy management architecture for multimedia applications in battery powered
devices. IEEE Trans Consum Electron 55:763–767
Kauff P et al (2007) Depth map creation and image-based rendering for advanced 3DTV services
providing interoperability and scalability. Signal Process Image Commun 22:217–234
Kay Roger (2011) Forbes. Is the PC dead? http://www.forbes.com/sites/rogerkay/2011/02/28/
is-the-pc-dead/
Khailany BK et al (2008) A programmable 512 GOPS stream processor for signal, image, and
video processing. IEEE J Solid-State Circuits 43:202–213
Kim B-G, Cho C-S (2007) A fast inter-mode decision algorithm based on macro-block tracking for
P slices in the H.264/AVC video standard. In: International conference on image processing
(ICIP), September 16-19, 2007, San Antonio, Texas, USA, pp V-301–V-304
Kim C, Kuo C-CJ (2007) Feature-based intra-/intercoding mode selection for H.264/AVC. IEEE
Trans Circuits Syst Video Technol 17:441–453
Kim D-Y, Lee Y-L (2011) A fast intra prediction mode decision using DCT and quantization for
H.264/AVC. Signal Process Image Commun 26:455–465
Kim Y, Kim J, Sohn K (2007a) Fast disparity and motion estimation for multi-view video coding.
IEEE Trans Circuits Syst Video Technol 53:712–719
Kim Y, Kim J, Sohn K (2007b) Fast disparity and motion estimation for multi-view video coding.
[s.l.]. IEEE Trans Circuits Syst Video Technol 53: 712–719
Ko H, Yoo K, Sohn K (2009) Fast mode-decision for H.264/AVC based on inter-frame correlations.
Signal Process Image Commun 24:803–813
Kollig P, Osborne C, Henriksson T (2009) Heterogeneous multi-core platform for consumer multi-
media applications. In: Design, automation test in Europe conference, April 20-24, Nice,
France, pp 1254–1259
Kondo H et al (2009) Heterogeneous multicore SoC with SiP for secure multimedia applications.
IEEE J Solid-State Circuits 44:2251–2259, 0018-9200
Koo H-S, Jeon Y-J, Jeon B-M (2007) MVC Motion skip mode - Doc. JVT-W081
Krolikoski Stan (2004) Chipvision design systems. Orinoco saves power. http://www.eda.org/edps/
edp04/submissions/presentationKrolikoski.pdf
References 193
Krügera J et al (2005) Image based 3DSurveillance for flexible Man-Robot-cooperation. CIRP Ann
Manuf Technol 54:19–22
Kuhn P (1999) Algorithms, complexity analysis and VLSI architectures for MPEG-4 motion
estimation. Kluwer Academic, Boston, MA
Kume Hideyoshi (2010) Panasonic’s new Li-Ion batteries use Si Anode for 30% higher capacity.
TechOn. http://techon.nikkeibp.co.jp/article/HONSHI/20100223/180545/
Kwon D-K, Shen M-Y, Kuo C-CJ (2007) Rate control for H.264 video with enhanced rate and
distortion models. IEEE Trans Circuits Syst Video Technol 17:517–529, 1051-8215
Lee P-J, Lai Y-C (2011) Vision perceptual based rate control algorithm for multi-view video cod-
ing. In: International conference on system science and engineering (ICSSE), 8-10 June, 2011,
Macao, China pp 342–345
Lee S-Y, Shin K-M, Chung K-D (2008) An object-based mode decision algorithm for multi-view
video coding. In: International symposium on multimedia (ISM), December 15-17, 2008,
Berkeley, California, USA pp 74–81
Li Z. G. et al (2003) Adaptive basic unit layer rate control for JVT - JVT-G012. Joint Video Team
(JVT) of ISO/IECMPEG & ITU-T VCEG 7th Meeting, Pattaya, Thailand
Liang Y, Ahmad I (2009) Power and distortion optimization for pervasive video coding. IEEE
Trans Circuits Syst Video Technol 19:1436–1447
Lim KP (2003) Fast inter mode selection—Document JVT-I020. In: 9th JVT Meeting
Lin J-P, Tang Angela C-W (2009) A fast direction predictior of inter frame prediction for multi-
view video coding. In: IEEE international symposium on circuits and system, IEEE, Piscataway,
pp 2598–2593
Lin Y-K et al (2008) A 242mW 10mm2 1080p H.264/AVC high profile encoder chip. In: Design
automation conference, DAC Anaheim, CA, USA, June 8-13, pp 78–83
Ling N (2010) Expectations and challenges for next generation. In: Conference on industrial elec-
tronics and applicationsis, ICIEA, 15-17 June 2010, Taichung, Taiwan pp 2339–2344
Liu X, Shenoy P, Corner MD (2008) Chameleon: applicationlevel power management. IEEE Trans
Mobile Comput 7:995–1010, 1536-1233
Liu A et al (2010) Just noticeable difference for images with decomposition model for separating
edge and textured regions. IEEE Trans Circuits Syst Video Technol 20:1648–1652
Lu X et al (2005) Fast mode decision and motion estimation for H.264 with a focus on
MPEG- 2/H.264 transcoding. In: International conference on circuits and systems (ISCAS),
23-26 May 2005, Kobe, Japan, pp 1246–1249
Ma S, Gao W, Lu Y (2005) Rate-distortion analysis for H.264/AVC video coding and its applica-
tion to rate control. IEEE Trans Circuits Syst Video Technol 15:1533–1544
Marlow S, Ng J, McArdle C (1997) Efficient motion estimation using multiple log searching and
adaptive search windows, IPA, July 1997. Dublin, Ireland In: International conference on
image processing and its applications, pp 214–218
McCann K et al (2012) Technical Evolution of the DTT Platform—an independent report by
ZetaCast, commissioned by Ofcom. http://stakeholders.ofcom.org.uk/binaries/consultations/
uhf-strategy/zetacast.pdf. Accessed Jan 2012
Meng B et al (2003) Efficient intra-prediction mode selection for 4x4 blocks in H.264. In:
International conference on multimedia and expo (ICME), 6-9 July 2003, Baltimore, MD,
USA. pp III-521–III-524
Mentor Graphics (2012) ModelSim—advanced simulation and debugging. http://model.com/
Merkle P et al (2007) Efficient prediction structures for multiview video coding. IEEE Trans
Circuits Syst Video Technol 17:1461–1473
Merkle P et al (2009) Stereo video compression for mobile 3D services. In: 3DTV, May 04-06
2009, Potsdam, Germany Conference, pp 1–4
Merritt L, Vanam R (2007) Improved rate control and motion estimation for H.264 Encoder. In:
IEEE international conference on image processing ICIP, September 16-19, 2007, San Antonio,
Texas, USA, pp V-309–V-312
Miano J (1999) Compressed image file formats: Jpeg, Png, Gif, Xbm, Bmp. ACM Press, Boston,
MA
194 References
Mondal S, Ogrenci Memik S (2005) Fine-grain leakage optimization in SRAM based FPGAs. In:
ACM great lakes symposium on VLSI, Chicago, Illinois, USA, April 17-19, 2005,
pp 238–243
Morari M, Lee JH (1999) Model predictive control: past, present and future. Comput Chem Eng
23:667–682
Muller K et al (2005) 3-D reconstruction of a dynamic environment with a fully calibrated back-
ground for traffic scenes. [s.l.]. IEEE Trans Circuits Syst Video Technol 15: 538–549
Naccari M et al (2011) Low Complexity Deblocking Filter Perceptual Optimization For The
HEVC codec. In: International conference on image processing ICIP, Brussels, Belgium,
September 11-14, 2011, pp 737–740
Nintendo (2011) Nintendo 3DS. http://www.nintendo.com/3ds
Nvidia (2012a) Nvidia GeForce GX690. http://www.geforce.com/hardware/desktop-gpus/
geforce-gtx-690
Nvidia (2012b) Tegra 3 super processors. http://www.nvidia.com/object/tegra-3-processor.html
Nvidia Corp. (2012) Tegra 2 and Tegra 3 super processors. http://www.nvidia.com/object/tegra-3-
processor.html
Oh K-J, Lee J, Park D-S (2011) Multi-view video coding based on high efficiency video coding.
In: Pacific Rim conference on advances in image and video technology, PSIVT 2011, November
2011, Gwangju, South Korea, pp 371–380
Ostermann J et al (2004) Video coding with H.264/AVC: tools, performance, and complexity.
IEEE Circuits Syst Mag 4(1st Quarter):7–28
Otero A et al (2010) Run-time scalable systolic coprocessors for flexible. In: International confer-
ence on field programmable logic and applications (FPL), August 31 2010-September 2, 2010,
Milano, Italy pp 70–76, 1946-1488
Ou T-S, Huang Y-H, Chen HH (2009) Efficient MB and prediction mode decisions for intra predic-
tion of H.264 high profile. In: Picture coding symposium (PCS), 6-8 May, Chicago, IL, USA.
pp 1–4
Ozbek N, Tekalp AM, Tunali ET (2007) Rate allocation between views in scalable stereo video
coding using an objective stereo video quality measure. In: International conference on acous-
tics speech and signal processing (ICASSP), April 15-20, 2007, Honolulu, Hawaii, USA.
pp 1045–1048
Pan F et al (2005) Fast mode decision algorithm for intraprediction in H.264/AVC video coding.
IEEE Trans Circuits Syst Video Technol 15:813–822
Panasonic Panasonic HDC-SDT750K (2011). http://www2.panasonic.com/consumer-electronics/
support/Cameras-Camcorders/Camcorders/3D-CAMCORDERS/model.HDC-SDT750K
Panda PR, Dutt ND, Nicolau A (1997) Architectural exploration and optimization of local memory
in embedded systems. In: International symposium on system synthesis ISSS, September
17-19, 1997, Antwerp, Belgium, vol 10, pp 90–97
Park I, Capson DW (2008) Improved inter mode decision based on residue in H.264/AVC. In:
International conference on multimedia and expo (ICME), June 23-26 2008, Hannover,
Germany, pp 709–712
Park S, Sim D (2009) An efficienct rate-control algorithm for multi-view video coding. In: IEEE
international symposium on consumer electronics (ISCE ), May 25-28, 2009, Mielparque-
Kyoto, Kyoto, Japan, pp 115–118
Park JS, Song HJ (2006) Selective intra prediction mode decision for H.264/AVC encoders. Int J
Appl Sci Eng Technol 13:214–218
Payá-Vayá G et al (2010) VLIW architecture optimization for an efficient computation of stereo-
scopic video applications. In: International conference on green circuits and systems (ICGCS),
21-23 June, Shanghai, china, pp 457–462
Pei G et al (2002) FinFET design considerations based on 3-D simulation and analytical modeling.
IEEE Trans Electron Devices 49:1411–1419, 0018-9383
Peng Z et al (2008a) Fast macroblock mode selection algorithm for multiview video coding.
In: EURASIP J Image Video Process, Volume 2008:393727 doi:10.1155/2008/393727
References 195
Smolic A et al (2007) Coding algorithms for 3DTV—a survey. IEEE Trans Circuits Syst Video
Technol 17:1606–1621
Social Times (2011) Social times. Cisco predicts that 90% of all internet traffic will be video in the
next three years. http://socialtimes.com/cisco-predicts-that-90-of-all-internet-traffic-will-be-
video-in-the-next-three-years_b82819
Softpedia (2010) ARM wants a share out of the server and desktop pc market by 2015. Softpedia.
http://news.softpedia.com/newsImage/ARM-Wants-a-Share-of-the-Server-and-Desktop-PC-
Market-by-2015-5.png/
Sony (2011) HDR-TD10—full HD 3D Camcorder. http://www.sonystyle.com/webapp/wcs/stores/
servlet/ProductDisplay?catalogId=10551&storeId=10151&langId=-1&produc
tId=8198552921666294297
Stelmach LB, Tam JW (1998) Stereoscopicimage coding: effect of disparate image-quality in left-
and right-eyeviews. Signal Process Image Commun 14:111–117, 0923-5965
Stelmach LB, Tam WJ (1999) Stereoscopic image coding: effect of disparate image-quality in left-
and right-eye views. [s.l.]. Signal Process Image Commun 14: 111–117
Stoykova E et al (2007) 3-D time-varying scene capture technologies: a survey. IEEE Trans
Circuits Syst Video Technol 17:1568–1586
Su Y, Vetro A, Smolic A (2006) Common test conditions for multiview video coding—Doc.
JVT-T207
Sullivan GJ, Ohm J-R (2010) Recent developments in standardization of high efficiency video
coding (HEVC). Proc. SPIE 7798, Applications of Digital Image Processing XXXIII, 77980V
(September 07, 2010); doi:10.1117/12.863486
Sullivan G, Wiegand T (1998) Rate-distortion optimizatoin for video compression. IEEE Signal
Process Mag 15:74–90
Sullivan GJ, Wiegand T (2005) Video compression—from concepts to the H.264/AVC standard.
Proc IEEE 93:18–31
Synopsys, Inc. (2012) IBM—65NM. http://www.synopsys.com/dw/emllselector.php?f=IBM&g=65
Tan TK, Sullivan G, Wedi T (2005) Recommended simulation conditions for coding efficiency
experiments—VCEG-AA10. Nice, [s.n.]
Tang Xiu-li, Dai Sheng-kui and Cai Can-hui (2010) An analysis of TZSearch algorithm in JMVC.
In: International conference on green circuits and systems (ICGCS), 21-23 June 2010,
Shanghai, china, pp 516–520
Tanimoto M (2005) FTV (free viewpoint television) creating ray-based image engineering. In:
International conference on image processing (ICIP), Genoa, Italy, September 11-14, pp 25–28
Tatjewski P (2010) Supervisory predictive control and on-line set-point optimization. Int J Appl
Math Comput Sci 20:483–495, 1641-876X
Tech G et al (2010) Final report on coding algorithms for mobile 3DTV // MOBILE3DTV—
Project No. 216503. http://sp.cs.tut.fi/mobile3dtv/results/tech/D2.6_Mobile3DTV_v1.0.pdf
Texas Instruments Inc. (2012) OMAP™ mobile processors: OMAP™ 5 platform. http://www.ti.
com/general/docs/wtbu/wtbuproductcontent.tsp?templateId=6123&navigationId=12862&con
tentId=101230
The Digital Entertainment Group (2009) 3D White Paper
Tian L et al (2010) Analysis of quadratic R-D model in H.264/AVC video coding. In: 17th IEEE
international conference on image processing (ICIP), September 26-29, Hong Kong, China,
pp 2853–2856
Tourapis AM (2002) Enhanced predictive zonal search for single and multiple frame motion esti-
mation. In: Visual communication and image processing conference (VCIP), Huang Shan,
An Hui, China, 11-14 July, 2010
Tsai C-Y et al (2007) Low power cache algorithm and architecture design for fast motion estima-
tion in H.264/AVC encoder system. In: International conference on acoustics speech and signal
processing (ICASSP), April 15-20, 2007, Honolulu, Hawaii, USA, pp II-97–II-100
Tsung P-K et al (2009) Cache-based integer motion/disparity estimation for quad-HD H.264/AVC
and HD multiview video coding. In: International conference on acoustics, speech and signal
processing, IEEE (ICASSP), 19-24 April 2009, Taipei, Taiwan Taipei, pp 2013–2016
References 197
Tuan T, Kao S, Trimberger S (2006) A 90nm low-power FPGA for battery-powered applications.
In: International symposium on field programmable gate arrays (FPGA), pp 3–11
Vimeo (2012) Vimeo 3D. http://vimeo.com/channels/stereoscopy
Vizzotto BB et al (2012) A model predictive controller for frame-level rate control in multiview
video coding. IEEE international conference on multimedia & expo (ICME’12), Melbourne,
Australia, July 9-13, 2012, pp 485–490
Wang W, Mishra P (2010) Leakage-aware energy minimization using dynamic voltage scaling and
cache reconfiguration in real-time systems. In: 23rd international conference on VLSI design,
3-7 Jan. 2010, Bangalore, India, vol 23, pp 357–372
Wang X et al (2007) Fast mode decision for H.264 video encoder based on MB motion character-
istic. In: International conference on multimedia and expo (ICME), July 2-5, 2007, Beijing,
China, pp 372–375
Wang S-H, Tai S-H, Chiang T (2009) A low-power and bandwidth-efficient motion estimation IP
core design using binary search. IEEE Trans Circuits Syst Video Technol 19:760–765
Wei Z, Ngan KN, Li H (2008) An efficient intra-mode selection algorithm for H.264 based on edge
classification and rate-distortion estimation. Signal Process Image Commun 23:699–710
Welch G et al (2005) Remote 3D medical consultation, vol 2, pp 1026–1033
Wen J et al (2010) ESVD: an integrated energy scalable framework for low-power video decoding
systems. EURASIP J Wirel Commun Netw 5:5:1–5:13
Wiegand T et al (2003) Overview of the H.264/AVC video coding standard. IEEE Trans Circuits
Syst Video Technol 13:560–576
Willner K et al (2008) Mobile 3D video using MVC and N800 internet tablet. In: 3DTV confer-
ence 28-30 MAY 2008, ISTANBUL, TURKEY
Woo J-H et al (2008) A 195 mW/152 mW mobile multimedia SoC with fully programmable 3-D
graphics and MPEG4/H.264/JPEG. IEEE J Solid-State Circuits 43:2047–2056
Wu C-Y, Su P-C (2009) A region of interest rate-control scheme for encoding traffic surveillance
videos. In: International conference on intelligent information hiding and multimedia signal
processing (IIH-MSP), September 12 - 14, 2009 Kyoto, Japan, pp 194–197
Wu D et al (2004) Block inter mode decision for fast encoding of H.264. In: International confer-
ence on acoustics speech and signal processing (ICASSP), May 17-21, 2004, Montreal,
Canada, vol iii, pp 181–184
Xilinx, Inc. (2012) ISE design suite. http://www.xilinx.com/products/design-tools/ise-design-
suite/index.htm
Xu X, He Y (2008) Fast disparity motion estimation in MVC based on range prediction. In:
IEEE international conference on image processing, October 12-15, 2008, San Diego,
California, USA, 2008, San Diego. IEEE, Piscataway, pp 2000–2003
Xu L et al (2011) Priority pyramid based bit allocation for multiview video coding. In: IEEE visual
communications and image processing (VCIP), Tainan, Taiwan, November 6-9, 2011, pp 1–4
Yamaoka M, Shinozaki Y, Maeda N, Shimazaki Y et al (2004) A 300MHz 25 mu;A/Mb leakage
on-chip SRAM module featuring process-variation immunity and low-leakage-active mode
for mobile-phone application processor. In: IEEE international solid-state circuits conference
(ISSCC), February 9-11, 2004, San Francisco, CA, pp 494–542
Yamaoka M et al (2005) A 300-MHz 25uA/Mb-leakage on-chip SRAM module featuring process-
variation immunity and low-leakage-active mode for mobile-phone application processor.
[s.l.]. IEEE 40: 186–194
Yan T et al (2009a) Frame-layer rate control algorithm for multi-view video coding. In: ACM/
SIGEVO, Shanghai, China — June 12 - 14, 2009, summit on genetic and evolutionary compu-
tation, pp 1025–1028
Yan T et al (2009b) Rate control algorithm for multi-view video coding based on correlation analy-
sis. In: Symposium on photonics and optoelectronics (SOPO), Aug 23, 2009 - Aug 25, 2009
Wuhan, China pp 1–4
Yang J (2009) Multiview video coding based on rectified epipolar lines. In: International confer-
ence on information, communication and signal processing (ICICS), 7 – 10 December 2009,
Macau, China, pp 1–5
198 References
Yang S, Wolf W, Vijaykrishnan N (2005) Power and performance analysis of motion estimation
based on hardware and software realizations. IEEE Trans Comput 54:714–726
YouTube (2011) YouTube—Broadcast yourself. http://www.youtube.com/
YouTube 3D (2011) YouTube—3D Channel. http://www.youtube.com/user/3D
Yu AC (2004) Efficient block-size selection algorithm for inter-frame coding in H.264/MPEG-4
AVC. In: International conference on acoustic, speech and signal processing (ICASSP),
May 17-21, 2004, Montreal, Canada, pp III169–III172
Zatt B et al (2007) Memory hierarchy targeting bi-predictive motion compensation for H.264/AVC
decoder. In: IEEE computer society annual symposium on VLSI (ISVLSI), May 9-11, 2007,
Porto Alegre, Brazil, pp 445–446
Zatt B et al (2010) An adaptive early skip mode decision scheme for multiview video coding. In:
Picture coding symposium (PCS), Nagoya, Japan, 8-10 December, pp 42–45
Zatt B et al (2011a) A low-power memory architecture with application-aware power management
for motion & disparity estimation in multiview video coding. In: IEEE/ACM 29th international
conference on computer-aided design (ICCAD’11), November 7-11, 2010, San Jose, CA,
USA, vol 29, pp 40–47
Zatt B et al (2011b) A multi-level dynamic complexity reduction scheme for multiview video cod-
ing. IEEE 18th international conference on image processing (ICIP’11), Brussels, Belgium,
September 11-14, 2011, vol 18, pp 761–764
Zatt B et al (2011c) Multi-level pipelined parallel hardware architecture for high throughput
motion and disparity estimation in multiview video coding. In: IEEE/ACM 14th design auto-
mation and test in europe conference (DATE’11), Grenoble, France, March 14-18, 2011, vol 14,
pp 1448–1453
Zatt B et al (2011d) Run-time adaptive energy-aware motion and disparity estimation in
multiview video coding. In: ACM/IEEE/EDA 48th design automation conference (DAC’11),
San Diego, California, USA, June 5-10, pp 1026–1031
Zeng H, Ma K-K, Cai C (2011) Fast mode decision for multiview video coding using mode
correlation. IEEE Trans Circuits Syst Video Technol 21:1659–1666
Zhang K, Bhattacharya U, Zhanping Chen, Hamzaoglu F, Murray D, Vallepalli N, Yih Wang,
Zheng B, Bohr M, “SRAM design on 65-nm CMOS technology with dynamic sleep transistor
for leakage reduction,” Solid-State Circuits, IEEE Journal of , vol.40, no.4, pp.895,901, April
2005. doi: 10.1109/JSSC.2004.842846
Zhang Y et al (2009) ASIP approach for multimedia applications based on a scalable VLIW DSP
architecture. Tsinghua Sci Technol 14:126–132
Zhou Y et al (2011) PID-based bit allocation strategy for H.264/AVC rate control. IEEE Trans
Circuits Syst II Express Briefs 58:184–188
Zhu H, Luican II, Balasa F (2006) Memory size computation for multimedia processing applica-
tions. In: Asia and South Pacific conference on design automation, ASP-DAC 2006, Yokohama,
Japan, January 24-27, pp 802–807
Zone R (2007) Stereoscopic cinema and the origins of 3-D film, 1838–1952. ISBN 0813124611
Index
A D
Address generation unit (AGU), 129–130, 3D Blu-Ray, 29
134–135 3D-cinema systems, 29
Application-aware power gating, 9, 70, 128, 2D/3D digital video
146–147 1D-array/2D-array, 12
Application-specific integrated circuits FTV system, 14
(ASIC), 30, 32, 36, 155 HVS, 11
Availability, 84 macroblocks, 12
multiview video sequence, 12–13
RGB space, 11
B
YUV space, 11
Ballroom sequence, 80, 142, 145–146, 159
Decoded picture buffer (DPB), 59, 129
Basic unit-level bitrate distribution, 86–87
Disparity domain correlation, 15, 16, 19
Basic unit-level rate control
3D multimedia processing
MDP-based, 113
3D-video pre-and post-processing, 172
MVC, 65
dynamic search window, 171
for video-quality management
energy reduction, 169
coupled reinforcement learning, 121
fast ME/DE, 170–171
diagram, 119
frame and basic unit (BU) level, 170
Markov decision process, 119–121
HRC algorithm, 170
regions of interest, 120
HVS, 170
Benchmark video sequence, 152–155
issues and challenges, 6–7
Bitrate correlation analysis
MDP, 170
basic unit-level bitrate distribution, 86–87
mode decision algorithm, 169–170
frame-level bitrate distribution, 86
MPC, 170
view-level bitrate distribution, 85
multibank video on-chip memory, 171
MVC challenges, 172
C next-generation 3D-video coding, 172–173
CES video analyzer tool, 185–187 relax and aggressive strengths, 169–170
Coding mode correlation analysis requirements and trends, 3–5
analysis, 75–76 state of the art, 170–171
coding mode analysis summary, 80–81 video sequence, 171
coding mode distribution analysis, 74–75 3D-neighborhood correlation analysis
RDCost analysis, 79–80 bitrate correlation analysis
video property analysis, 76, 78–79 basic unit-level bitrate distribution,
Coupled reinforcement learning, 113, 121 86–87
B. Zatt et al., 3D Video Coding for Embedded Devices: Energy Efficient 199
Algorithms and Architectures, DOI 10.1007/978-1-4614-6759-5,
© Springer Science+Business Media New York 2013
200 Index
F K
Fast motion and disparity estimation Key frames (KF)
algorithm, 68–69, 107–109 fast ME/DE algorithm scheduling,
algorithm results, 109–111 138–139
flow diagram, 108 MVC encoder, parallelism in, 136
vs.TZ search, 109
Finite state machine (FSM), 131–133
Frame-level bitrate distribution, 86, 87, M
124, 125 Markov decision process (MDP), 48–49, 113,
Frame-level rate control 119–121, 170
bitrate prediction, 114–115 Mean absolute differences (MAD), 45
diagram, 115 Mean bit estimation error (MBEE), 162
evaluation, 117–118 Memory design methodology, 9
model predictive controller, 113–114 Mode decision (MD), 55
MPC-based, 113 Model predictive control (MPC)
for quality related changes, 65 based frame-level rate control, 46–47,
quantization parameter definition, 117 113, 170
rate model, 116 goal, 113
Free-viewpoint television (FTV), 3, 14 Monograph
3D-neighborhood correlation analysis, 7–8
energy-efficient hardware architecture, 9
H energy-efficient MVC algorithm, 8
Hierarchical rate control (HRC), 170 goal, 7
MBEE, 162 Motion and disparity estimation (ME/DE)
model predictive control-based rate Bi-prediction, 25–26
control, 69 design
for motion and disparity estimation, AGU, 134–135
170–171 application-aware power-gating
for MVC, 8, 111–113 scheme, 128–129
results architectural template, 129–130
BD-PSNR and BD-BR comparison, DPB, 129
123 dynamic search window formation
bitrate accuracy, 122 algorithm, 128
bitrate and PSNR distribution, 124, 125 energy/complexity-aware control
bitrate distribution at BU level, unit, 129
124–126 multibank on-chip memory, 128
view-level bitrate distribution, 124 on-chip video memory, 133–134
Human visual system (HVS), 11, 170 programmable search control unit,
131–133
SAD calculator, 130–131
J diamond search, 25
Joint model for multiview video coding enhanced predictive zonal search, 25
(JMVC) full search, 24–25
communication channels, 178–179 hardware architecture, 9
CreateH264AVCEncoder class, 175–176 log search, 25
encoder tracing, 178 motion/disparity vector prediction, 26–27
H264AVCEncoderTest, 175–176 multiple block sizes, 26
inter-frame search, 176–177 multiple reference frames and reference
ME/DE modification, 179 view, 26
mode decision modification, 179 pipeline scheduling
motion estimation, 176 common vector predictor, 139–140
PicEncoder, 175–176 fast operation modes, 139
rate control modification, 179–180 frame-level evaluation, 140
SliceEncoder, 176 generic search pattern, 137–138
202 Index
T
Q Temporal domain correlation, 15
Quality-complexity classes (QCC) Three-dimensional surveillance, 3
mode decision algorithm for Three-dimensional telemedicine, 3
pseudo-code of, 98 Three-dimensional telepresence, 3
RDCost characterization, 97–100 Three-dimensional television (3DTV), 2
types, 96–97 Three-dimensional video personal
Quality states (QS), 99–100, 105 recording, 2
Quantization parameter (QP) Thresholds
based thresholding, 65 probability density function, 87
definition, 117 quantization parameter, 88
RDCost property of SKIP MB, 88–89
R
Region of interest (RoI), 45, 49–50, 87, V
119, 120 Video-quality management
Rotating instruction set processing platform basic unit-level rate control
(RISPP), 31–32 coupled reinforcement learning, 121
diagram, 119
Markov decision process, 119–121
S regions of interest, 120
Search pattern memory (SPM), 131–133 frame-and BU-level RC, 45–46
Sobel filter, 154–155 frame-level rate control
Spatial domain correlation, 14–15 bitrate prediction, 114–115
Spatial–temporal–disparity index, 153–155 diagram, 115
SRAM. See Static random access evaluation, 117–118
memory (SRAM) model predictive controller, 113–114
State-of-the-art quantization parameter
benchmark video sequence, 152–155 definition, 117
vs. energy-efficient algorithm rate model, 116
energy-aware complexity adaptation, hierarchical rate control, 111–113
159–160 hierarchical rate control results, 121–126
fast ME/DE, 160–161 MAD, 45, 46
mode decision, 156–159 MDP, 48–49
vs. energy-efficient hardware architecture, MPC, 46–47
163–166 pyramid-based priority structure, 45–46
fairness of comparison, 155 quantization parameters (QP), 45–46
fast ME/DE algorithms, 43 region of interest, 49–50
hardware description and ASIC synthesis, reinforcement learning model, 49
155 vs. state-of-the-art, 161–164
software simulation, 151–153 Video scaling, 4
vs. video quality control algorithm, 161–164 View-level bitrate distribution, 85