(Ebook) Video-Based Rendering by Marcus A. Magnor, Marcus Andreas Magnor ISBN 1568812442
(Ebook) Video-Based Rendering by Marcus A. Magnor, Marcus Andreas Magnor ISBN 1568812442
https://ebooknice.com/product/how-video-works-second-edition-2003044
ebooknice.com
(Ebook) CYBER VIKING - BOX SET Book 1-4 by Marcus Sloss [Sloss,
Marcus]
https://ebooknice.com/product/cyber-viking-box-set-book-1-4-29576408
ebooknice.com
https://ebooknice.com/product/conflict-resolution-concepts-and-
practice-5393942
ebooknice.com
https://ebooknice.com/product/conversations-with-greil-marcus-11795450
ebooknice.com
https://ebooknice.com/product/exercises-in-organic-synthesis-based-on-
synthetic-drugs-51881492
ebooknice.com
https://ebooknice.com/product/a-story-of-islamic-art-54178362
ebooknice.com
https://ebooknice.com/product/rise-of-a-pirate-lord-26312416
ebooknice.com
https://ebooknice.com/product/vengeance-of-a-pirate-lord-26312428
ebooknice.com
Video based rendering First Edition Marcus A. Magnor
Digital Instant Download
Author(s): Marcus A. Magnor, Marcus Andreas Magnor
ISBN(s): 1568812442
Edition: First Edition
File Details: PDF, 2.23 MB
Year: 2005
Language: english
i i
i i
Video-Based Rendering
i i
i i
i i
i i
Video-Based Rendering
Marcus A. Magnor
Max-Planck-Institut für Informatik
Saarbrücken, Germany
A K Peters
Wellesley, Massachusetts
i i
i i
i i
i i
A K Peters, Ltd.
888 Worcester Street, Suite 230
Wellesley, MA 02482
www.akpeters.com
Copyright
c 2005 by A K Peters, Ltd.
All rights reserved. No part of the material protected by this copyright notice may
be reproduced or utilized in any form, electronic or mechanical, including photo-
copying, recording, or by any information storage and retrieval system, without
written permission from the copyright owner.
TK6680.5.M345 2005
006.6’96--dc22
2005050055
i i
i i
i i
i i
For Marian
i i
i i
i i
i i
Contents
Foreword ix
Preface xi
1 Introduction 1
2 Prerequisite Techniques 5
2.1 Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Geometry Reconstruction from Multiple Views . . . . . . . . . . 28
2.3 Image-Based Rendering . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Rendering on Graphics Hardware . . . . . . . . . . . . . . . . . 54
vii
i i
i i
i i
i i
viii Contents
7 Outlook 161
Bibliography 171
Index 195
i i
i i
i i
i i
Foreword
If you have seen any of the Star Wars movies, you will certainly remember the
futuristic video technology in a galaxy far, far away. Video is projected into free
space, visible from any point of view. Objects and people appear to be present
in almost ghostlike fashion as three-dimensional entities. Princess Leia’s call for
help is recorded by R2-D2 for playback at the opportune time, whereas the brave
Jedi knights virtually attend meetings, zapping their three-dimensional video per-
sonalities in real time across the galaxy. We are in awe of such technological mar-
vel, willing to forgive that the grainy, low-resolution video looks slightly worse
than 1950s television.
It’s fun to ask what it would take to make this vision of the future a real-
ity. First, one needs to capture and record three-dimensional objects and scenes.
Then there is the issue of transmitting the data, presumably in real time, over
the inter-galactic equivalent of WiFi (let’s call it WiGa for now). And finally we
need a display that projects three-dimensional images into free space, such that a
viewer receives stereoscopic images anywhere in the viewing area without special
glasses. (As far as we know, even Darth Vader doesn’t have stereo glasses built
into his helmet.)
This book will hopefully convince you that much of this technology (apart
maybe from WiGa) is not as far fetched as it seems. You will read about recording
people and stadium-sized scenes using large numbers of distributed cameras. A
viewer can pick arbitrary views and even virtually fly around the scene. Other
systems capture the light coming through a “window” in the scene using many
(maybe more than one hundred) closely arranged cameras. The captured light can
then be displayed on modern (even commercially available) screens that project
stereoscopic images to multiple viewers (without funny glasses). These scenarios
are known as interactive television, free-viewpoint video, and 3D television. They
ix
i i
i i
i i
i i
x Foreword
hold great promise for the future of television, virtual reality, entertainment, and
visualization.
Why has this book been written now, and not ten years ago? To paraphrase
Intel’s Andy Grove, we are at an inflection point in the history of visual media.
Progress in computers, digital storage, transmission, and display technology con-
tinues to advance at unprecedented rates. All of this allows video to finally break
free from the confines of the two-dimensional moving picture. Advances in com-
puter vision, computer graphics, and displays allow us to add scene shape, novel
view synthesis, and real-time interactivity to video. This book will show you how
to do it.
If sales of computer games and the popularity of stereoscopic movies are any
indication, we are all fascinated with new forms of visual media that allows us
to get a better sense of presence. This book is a snapshot of what is possible
today. I am sure it will stimulate your imagination until, someday, we may watch
a three-dimensional rendition of Star Wars projected into our living rooms.
i i
i i
i i
i i
Preface
This book is about a new interdisciplinary research field that links comput-
er graphics, computer vision, and telecommunications. Video-based rendering
(VBR) has recently emerged from independent research in image-based render-
ing, motion capture, geometry reconstruction, and video processing. The goal
in VBR is to render authentic views of real-world–acquired dynamic scenes the
way they would appear from arbitrary vantage points. With this ambition, VBR
constitutes a key technology for new forms of visual communications such as
interactive television, 3D TV, and free-viewpoint video in which the viewer will
have the freedom to vary at will his or her perspective on movie scenes or sports
events. To computer game developers, VBR offers a way out of the continuously
increasing modeling costs that, driven by rapidly advancing hardware capabilities,
have become necessary in order to achieve ever more realistic rendering results.
Last but not least, virtual reality applications, such as training simulators, benefit
from VBR in that real-world events can be imported into virtual environments.
Video-based rendering research pushes forward two scientific frontiers in uni-
son: novel multi-video analysis algorithms are devised to robustly reconstruct
dynamic scene shape, and adequate rendering methods are developed that syn-
thesize realistic views of the scene at real-time frame rates. Both challenges,
multi-video analysis and novel view synthesis, must be attacked in concert if our
highly sophisticated visual sense is to be persuaded of the result’s realism.
xi
i i
i i
i i
i i
xii Preface
Synopsis
After introducing the topic, Chapter 2 discusses a number of prerequisite tech-
niques that are related to or necessary for video-based rendering research. Chap-
ters 3 to 6 are devoted to specific VBR algorithms. Chapter 3 addresses VBR
approaches that rely on dense depth information. Time-critical VBR methods for
widely spaced video cameras are presented in Chapter 4. Chapter 5 describes two
approaches that make use of the continuous nature of any macroscopic motion.
Chapter 6 discusses a VBR method that exploits a priori knowledge about scene
content. The book closes with an outlook on future research and applications of
video-based rendering.
With the exception of Spatiotemporal View Interpolation, Section 5.1, all al-
gorithms are capable of rendering novel views at real-time frame rates on conven-
tional PC hardware. Other than that, the discussed methods differ with respect
to a number of criteria, e.g., online versus offline processing, small-baseline ver-
sus wide-baseline acquisition, implicit versus explicit geometry representation, or
silhouette- versus photo-consistent reconstruction. To quickly evaluate the useful-
ness of an individual VBR technique for a specific application, each algorithm is
briefly described and classified in Appendix B.
i i
i i
i i
i i
Preface xiii
Acknowledgments
The VBR approaches described in this book have been proposed by various re-
searchers. I thank Chris Buehler, Paul Debevec, Neel Joshi, Wojciech Matusik,
Hartmut Schirmacher, Sundar Vedula, and Larry Zitnick for providing me with
additional information on their work, going over the respective sections, and for
allowing me to include their figures to better illustrate “their” VBR techniques.
Several algorithms discussed in this book have originated in the “Graphics–
Optics–Vision” research group at the Max-Planck-Institut für Informatik. They
are the work of Lukas Ahrenberg, Bastian Goldlücke, Ivo Ihrke, Ming Li, and
Christian Theobalt, excellent PhD students who can convert ideas to working
pieces of software, finding solutions to any theoretical or practical obstacle en-
countered along the way. Many figures in this book are reprints from our joint
publications. I am lucky to have such a great research crew.
As I found out, finalizing a book in time is real team work. I am indebted
to Gernot Ziegler and Michael Gösele who put a lot of effort into helping me
identify and re-phrase unclear passages in the original manuscript. I also thank
Alice Peters, Kevin Jackson-Mead and Sannie Sieper at A K Peters for their great
assistance and fast work. Finally, I wish to express my sincere gratefulness to
Hans-Peter Seidel, my mentor without whom I would not have discovered this
fascinating and rewarding research field. It is to his persistent encouragement that
the existence of this book is owed.
i i
i i
i i
i i
Chapter 1
Introduction
The visual sense is the most sophisticated, most advanced sensory channel con-
necting the real world to human consciousness. Its online analysis capabilities
are enormous, and its reliability is proverbial. “Seeing is believing:” we are ac-
customed to unconditionally accept as objective reality what our visual system
judges to be real.
It is the art of computer graphics rendering to “trick” the human visual system
into accepting virtual, computer-synthesized images as being genuine. From its
onset in the 1950s, impressive progress has been made in realistic image synthe-
sis. Continued algorithmic advances, in conjunction with ever-increasing com-
putational resources, have made possible ever more realistic rendering results.
Whether on modern graphics hardware or on ray-tracing clusters, time-critical
rendering applications today achieve a degree of realism that was unattainable
only a few years ago. From animated feature films and special movie effects
to computer games and training simulators, numerous new business fields have
emerged from realistic rendering research.
With increasing hardware and software capabilities, however, rendering re-
alism has become more and more dependent on the quality of the descriptions
of the scenes to be rendered. More and more time has to be invested into mod-
eling geometry, reflectance characteristics, scene illumination, and motion with
sufficient detail and precision.1 Increasingly, progress in realistic rendering is
threatened to become thwarted by the time-consuming modeling process.
In response to this dilemma, an alternative rendering concept has established
itself over the last decade: image-based modeling and rendering.2 In image-
1 The term model is used throughout the book to denote a description of a real-world scene at a
level of abstraction that is suitable for generating novel views from different (arbitrary) viewpoints.
2 See Shum et al. Image-Based Rendering. New York: Springer, 2005.
i i
i i
i i
i i
2 Chapter 1. Introduction
Scene Geometry
Motion
Surface Reflectance
Scene Illumination
Analysis Synthesis
Image
Figure 1.1. Image- and video-based rendering: From the description of a (time-varying)
scene in terms of 3D geometry, motion, surface reflectance, and illumination, computer
graphics rendering techniques are able to synthesize realistic images of the scene from
arbitrary perspectives. Computer vision addresses the much harder, and in the general
case even ill-posed, inverse problem of recovering the abstract scene description from real-
world image data. In image- and video-based rendering, both sides, analysis and synthesis,
are regarded collectively for best rendering results. To render a scene from arbitrary view-
points, but under identical lighting conditions, only geometry and object motion must be
reconstructed explicitly.
i i
i i
i i
i i
Introduction 3
i i
i i
i i
i i
4 Chapter 1. Introduction
i i
i i
i i
i i
Chapter 2
Prerequisite Techniques
2.1 Acquisition
Video-based rendering begins with recording multiple video streams of real-world
scenes. Input data acquisition is a crucial part of VBR, as it determines attainable
rendering quality and, therefore, potentially also overall usefulness. To achieve
optimal results, the acquisition set-up must be adapted to scene content and appli-
cation scenario.
In image-based rendering, light fields of static scenes are customarily recorded
by using a single still-image camera. Multiple images are recorded in sequence
simply by moving the camera around the scene [169]. In contrast, acquiring the
time-varying light field of a dynamic scene requires recording the scene simul-
taneously from different vantage points. Instead of a single still-image camera,
multiple video cameras are necessary. In addition, all video cameras must be syn-
chronized in order to be able to relate image content across cameras later. Such
a multi-video acquisition system imposes challenging hardware demands that are
constrained further by financial and logistical issues.
i i
i i
i i
i i
Solid-State Imagers. For digital image acquisition, solid-state CCD [135] and
CMOS imaging chips are used almost exclusively today. In comparison, tradi-
tional cathode-ray tube (CRT) cameras have a number of inherent drawbacks,
such as non-linear response curve, inferior light sensitivity, higher electronic noise
level, larger size, and higher cost. Solid-state imagers are characterized by their
resolution, pixel area, light sensitivity, imaging noise, and quantum well capacity.
The size of one pixel, or more accurately the grid constant of the regular 2D pixel
array, ∆x, is referred to as pixel pitch; see Figure 2.1. Light sensitivity varies with
wavelength. For a given wavelength, chip sensitivity is specified by the conver-
sion efficiency. The conversion efficiency factor denotes the percentage of pho-
tons that are able to excite electrons in the valence band across the silicon’s band
gap to the conduction band. Imaging noise is fundamentally present due to the
temperature of the sensor. Additionally, the on-chip analog-to-digital converter
introduces noise to the signal. The quantum well capacity specifies the maximum
number of photons that can be accumulated per pixel. Together with the imaging
noise level, the quantum well capacity determines the useful dynamic range of
i i
i i
i i
i i
2.1 Acquisition 7
an imaging chip. In general, smaller pixels have a lower quantum well capacity
leading to a limited dynamic range of typically no more than eight bits. Due to
their different design, CMOS chips vary by fill factor, denoting the percentage
of light-sensitive area of each pixel in comparison to total chip area covered by
one pixel, while CCD chips differ in charge transfer efficiency, i.e, loss of signal
during image readout. While CMOS chips are cheaper to manufacture and easier
to control, so far, CCD chips are still superior with respect to imaging noise, light
sensitivity, and the fraction of “hot” (faulty) pixels. With the advent of mass-
produced CMOS-based webcams, however, CMOS imaging technology has been
constantly gaining ground on CCDs in recent years.
Commercial Cameras. Most VBR pioneering labs relied on the Sony DFW-
V500 camera [259, 208, 209]. At that time, it was about the only commercially
available digital color video camera that featured progressive scan mode and that
could be externally synchronized. In external synchronization mode, the Sony
camera achieves approximately 13 frames per second. It is fitted with a standard
C lens mount. The color images are recorded by a single-chip, 1/3-inch, 640 ×
480-pixel CCD chip using the Bayer mosaic technique [14]. All video cameras
used so far in VBR research record color using this approach. A Bayer mosaic
consists of a regular matrix of tiny red, green, and blue color filters right on top
of the CCD chip. Plate I depicts the filter arrangement. Two green, one red,
and one blue filter are arranged in a block of 2 × 2 pixels, following our visual
system’s emphasis on luminance information (approximately the green channel).
As a rule of thumb, the effective resolution of a Bayer mosaic camera corresponds
approximately to a full-color (e.g., 3-chip) camera with half as many pixels. On-
board the Sony DFW-V500, color information is automatically interpolated to
mimic full CCD-chip resolution. The camera outputs a YUV 4:2:2 digital video
signal, i.e., the luminance channel has (interpolated) 640 × 480-pixel resolution,
while the chrominance channels contain 320 × 240 pixels. Note that the output
signal requires 50 percent more bandwidth than was actually recorded by the chip.
The camera is connected to a host computer via the IEEE1394 (FireWireTM ) bus.
The bus bandwidth of 400 MBits/s allows the controlling of two cameras from
one host PC. The camera model features some additional on-board preprocessing,
such as gain control and white-balance adjustment. Different sets of parameter
values can be stored in camera memory for fast switching.
In recent years, other manufacturers have built externally triggerable video
cameras, some of which have already been employed in VBR research. The 3D
TV system [207] is based on the Basler A101fc camera model. It records 12
frames per second at 1300 × 1030-pixel resolution on a 2/3-inch CCD chip. The
camera can output raw pixel data, so transmission bandwidth of the FireWireTM
i i
i i
i i
i i
bus is used efficiently, and the conversion from Bayer mosaic data to, e.g., RGB
color space can be controlled. The Basler models also feature the standard C lens
mount.
The acquisition set-up of Microsoft’s Video View Interpolation project [345]
consists of PointGrey cameras that deliver 1024 × 768-pixel color images at 15
frames per second [130]. The video images are downstreamed via fiber optics
cables to a custom-built storage system.
The use of CMOS imagers instead of CCD chips was pioneered in the “Stan-
ford Multi-Camera Array” [312]. The system consists of custom-built camera
heads using the Omnivision OV8610 CMOS imaging chip. The camera heads
record 640 × 480-pixel color images in YUV 4:2:2 format at 30 frames per sec-
ond. One control board per camera preprocesses the images and encodes the video
data as MPEG-2 bitstreams. Up to 26 compressed video streams are downlinked
via the IEEE1394 bus to one PC, where the bitstreams are stored on SCSI hard
drives. As the Stanford Multi-Camera Array provides MPEG-compressed im-
age data only, quantization noise and blocking artifacts deteriorate image quality,
posing a challenge to further processing algorithms.
Table 2.1 summarizes the technical specifications of different camera mod-
els used so far in VBR research. Of course, new camera models with improved
capabilities will continue to become available. The amount of data that can be
processed using conventional PC technology, however, puts a limit on manage-
able acquisition resolution, frame rate, and color depth. With consumer-market
PC memory, hard drive capacity, and CPU clock rate apparently levelling out in
recent years, it may be surmised that VBR recording resolution and frame rate
will not increase by orders of magnitude in the coming years from the level of the
latest camera models in Table 2.1.
The Lens. Besides camera electronics, it is the lens that determines recorded
image quality. No amount of processing on an image taken through an inferior
lens can achieve the image quality of an appropriate, first-grade lens objective.
The standard lens mounts for video cameras are the so-called C- and CS-
mounts for which a large palette of lenses are commercially available. To achieve
the best possible recording results, the appropriate lens must be selected for the
i i
i i
i i
i i
2.1 Acquisition 9
Figure 2.1. A camera lens is characterized by its focal length f and its maximal free
aperture D, or minimum f-stop number F = D/f . To achieve the best recording results,
lens parameters must be matched to pixel pitch (pixel dimension) ∆x and overall imaging
chip extent N ∆x.
camera chip’s specifications. Figure 2.1 illustrates the most important parameters
of the lens-chip imaging system.
To avoid vignetting, the lens must be matched to the size of the camera’s CCD
sensor to evenly illuminate the entire CCD chip area N ∆x. Typical video sen-
sor sizes are 1/3-inch, 1/2-inch, and 2/3-inch, measured along the chip diagonal.
Video camera lenses are specifically designed for these different chip sizes, with
prices increasing from the small to the large illumination field. Lenses for larger
sensors can be used in conjunction with smaller CCD chips, but not vice versa.
To make full use of camera resolution, the lens must produce images that are
“as sharp” as can be imaged by the camera. In optical terms, the lens’s point
spread function (PSF) should approximately match the size of one pixel. For a
good lens, its Airy disk diameter A depends on wavelength λ and f-stop number F :
A = 2.44 λ F . (2.1)
The f-stop number, or f-stop, denotes the ratio of free aperture diameter D and fo-
cal length f of the lens, F = D/f . For green light (λ = 500 nm) and F = 2.0, a
first-grade lens concentrates most intensity into a disc of A = 2.44 µm in diame-
ter. Following the Rayleigh criterion, the smallest resolvable distance corresponds
to the case when two point light sources are imaged A/2 apart. To record at max-
imum resolution, the sampling theorem dictates that at least two pixels must fall
within this length. In summary, the largest permissible pixel dimension without
sacrificing any theoretically attainable resolution is ∆x = A/4. The f-stop of
a lens objective can typically be varied to adjust for different light levels, so the
Airy disk diameter increases with increasing f-stop number. If a lens is stopped
down too much, its resolution can become worse than the pixel resolution of the
camera, which would be a waste of available imaging resources. CCD pixel size
∆x therefore limits the range of reasonable f-stops: the larger the CCD pixels,
i i
i i
i i
i i
the smaller the f-stop number may be. On the other hand, if the Airy disk A is
smaller than about four times the pixel size ∆x, the recorded image is not ad-
equately prefiltered, and aliasing artifacts occur. While the above argument is
based on a single wavelength, the lens must also be corrected for chromatic aber-
rations to record sharp color images, which is the main reason excellent camera
lenses are expensive.
The knife-edge technique constitutes a practical and convenient way to mea-
sure the effective resolution of a lens-sensor system [248]. It consists of recording
the image of a straight edge that is oriented at a slightly slanted angle with respect
to the CCD chip orientation. The slope of the imaged edge is determined, and all
image scan lines are registered into a single, super-resolution edge profile. This
edge profile is also referred to as the line spread function. The Fourier transform
of the line spread function is the optical system’s modulation transfer function
(MTF). It captures the imaging system’s response to (linear) spatial frequency.
The resolution limit of an optical system is conventionally said to be the fre-
quency at which the MTF has fallen to about 10 percent of its DC (maximum)
value [314].
Besides controlling the amount of light that falls onto the image sensor and
its effect on imaging resolution, the aperture stop (f-stop) has another important
influence on imaging quality. It determines the depth of field, i.e., the depth range
within which the scene is captured at full camera resolution. The depth range
towards, ∆min , and away from the camera, ∆max , that is still imaged sharply de-
pends on pixel size ∆x, focal length f , f-stop F , and distance d:
∆x F d2
∆min = ,
f2 + ∆x F d
∆x F d2
∆max = 2
.
f − ∆x F d
For example, a camera with pixel size ∆x = 5µm that is equipped with a lens
featuring f = 15 mm focal length, focused at d = 3 m distance, and stopped down
to F = 4.0 captures the depth range from d − ∆min = 3m − 0.63m = 2.37m to
d + ∆max = 3m + 1.09m = 4.09m at full resolution. For ∆x = 5µm, F = 4.0
is the largest f-stop number before sacrificing camera resolution at λ = 500 nm.
Some video objective lenses do not allow varying the focus. Instead, the focus
is set by the manufacturer at the hyperfocal distance dhf corresponding to some
f-stop F and pixel size ∆x:
f2
dhf = . (2.2)
∆x F
Such fixed-focus lenses give sharp images of objects at distances from about dhf /2
to infinity. However, because the correct hyperfocal distance depends on CCD
i i
i i
i i
i i
2.1 Acquisition 11
pixel size ∆x, the lens must be specifically manufactured for that camera model.
Lenses with adjustable focal distance offer a wider range of applications, e.g.,
they can also be used for close-up recordings. On the other hand, such lenses
must be individually focused, which is a non-trivial task and can turn out to be
a tedious chore for recording scenarios with many cameras. Also, if the focus
changes between recordings, e.g., due to different room temperature or acciden-
tal touching, the cameras must be refocused, and potentially even re-calibrated
(Section 2.1.5), before each recording session.
The focal length f of the camera lens and overall CCD chip size N ∆x deter-
mine the imaged viewing angle,
N ∆x
α = 2 arctan .
2f
Typically, maximum distance from the scene, and therefore necessary viewing
angle, is determined by the size of the recording studio. Wide-angle lenses offer-
ing angles larger than 40◦ to 50◦ potentially exhibit substantial radial distortion,
which must be corrected for prior to VBR processing. Normal and telephoto
lenses show less distortion, which makes camera calibration considerably easier.
Zoom lenses with variable focal length are not recommended for VBR applica-
tions, simply because the cameras must be recalibrated whenever the focal length
is changed.
One issue to keep in mind is whether the imaging chip is already equipped
with an infrared-blocking (IR cut) filter. The silicon chip is highly sensitive to in-
frared radiation that is invisible to the human eye. If infrared light is not blocked,
the images depict wrong object colors, and color calibration is not possible. If
the camera is sensitive to IR, an infrared-blocking filter must be mounted onto the
camera lens.
i i
i i
i i
i i
Figure 2.2. The Stanford Multi-Camera Array consists of one hundred CMOS camera
heads. The cameras can be arranged in a planar matrix with aligned optical axes to record
dynamic scenes in the two-plane light field parameterization. (Reprinted from [103].)
i i
i i
i i
i i
2.1 Acquisition 13
field images that are not recorded with parallel optical camera axes can be rectified
after acquisition by applying a projective transform to each camera image. Steve
Seitz describes how to determine the projection transformation from point corre-
spondences, as well as how to warp and resample the light field images [263, 262].
Suitably implemented, e.g., on graphics hardware, image rectification can be car-
ried out on-the-fly at acquisition frame rates [332].
• The cameras’ depth of field can cover the entire scene volume.
In the confines of a studio room, maximum distance from the scene is limited.
Recording the scene from above can prove impossible if the ceiling height is that
of a normal office room. On the other hand, limited ceiling height allows the use
of telescope poles [200]. The poles can be jammed between the floor and ceiling
at arbitrary positions in the room to serve as mounting rigs for the cameras; see
Figure 2.3. This way, utmost camera positioning flexibility is achieved while
providing sufficient stability and alignment robustness, e.g., against accidental
bumps. To clamp the cameras to the telescope poles and to orient the cameras
in arbitrary directions, mounting brackets with three-degrees-of-freedom camera
heads can be used [200].
i i
i i
i i
i i
Figure 2.3. Camera mounting: the photo industry offers telescope poles that can be
jammed between floor and ceiling, offering great flexibility in camera positioning. To affix
the cameras to the poles, freely orientable mounting brackets are used. ((left) Reprinted
from [291]. (middle, right) Reprinted from [176], courtesy of Ming Li.)
Mobile Set-Up. A studio room restricts the type of dynamic events that can
be acquired to indoor activities only. To record multi-video footage of events in
their natural environment, a multi-video recording system is needed that can be
folded up, transported to arbitrary locations, and set up almost anywhere without
too much effort. This requires a compact and adaptable system for synchronized
multi-video acquisition.
A prototype of a mobile multi-video acquisition system has been built by
Lukas Ahrenberg [3]. The system is composed of several autonomous, portable
laptop-camera modules that capture and store video data independent of any fur-
Figure 2.4. For a mobile multi-video acquisition system, each camera module must run
autonomously. Built from consumer-market components, each unit consists of a laptop,
camera, tripod, and battery pack. The complete module weighs less than 10 kg and fits
into a backpack for convenient transportation. (Reprinted from [3],
c 2004 IEE.)
i i
i i
i i
i i
2.1 Acquisition 15
Figure 2.5. Modular acquisition system design: the laptop-camera units are controlled
and synchronized via wireless LAN. (Reprinted from [3],
c 2004 IEE.)
ther infrastructure; see Figure 2.4. The modules are built from off-the-shelf com-
ponents only. Each module consists of a laptop PC with wireless LAN capabili-
ties, a Sony DFW-V500 video camera, a 12V rechargeable battery, a FireWireTM
card adapter, and a tripod.
A number of issues arise when adapting the equipment for use in a stand-alone
system. One such challenge is the power supply for the camera. The Sony camera
model draws its current from the FireWireTM cable. Laptop FireWireTM cards,
however, do not exhibit the 12V output line that PC FireWireTM cards feature.
In addition, the camera’s energy consumption of about 4 watts would quickly
drain the laptop battery and severely limit operation time. So instead, a Card-Bus
adapter is inserted in the laptop’s FireWireTM port that enables feeding an external
power source to the FireWireTM cable. While the rechargeable 12V battery adds
extra pounds to the module, the total weight of the complete unit is still less than
10 kg.
To synchronize multi-video acquisition, the camera modules are not physi-
cally connected by cable. Instead, a radio-based wireless network is used; see
Figure 2.5. The 802.11 wireless local area network (WLAN) is run in ad-hoc
mode, so that additional camera modules can be added or removed without the
need to reconfigure the existing network. A client-server architecture is estab-
lished among the laptops. The master laptop broadcasts instructions to all client
modules via WLAN to remotely control the entire system. Due to limited avail-
ability of the drivers for the laptop’s hardware components, the system is imple-
mented using the non-real-time operating system Windows 2000.
i i
i i
i i
i i
i i
i i
i i
i i
2.1 Acquisition 17
2.1.3 Illumination
It is well known in photography that scene illumination has a strong impact on
object appearance. In television studios, theaters, or at movie sets, scene lighting
is the task of specialists. Remarkably, illumination has not yet been given serious
consideration in VBR acquisition, even though VBR imposes strict requirements
on scene lighting to ensure optimal acquisition:
• Intensity: scene foreground must be brightly lit to allow for short exposure
times, which may be necessary to avoid motion blur. A brightly lit scene is
also desirable to realize small internal video camera gain settings in order
to keep electronic noise to a minimum while exploiting the cameras’ full
dynamic range. Also, high illumination levels allow for higher f-numbers,
which increases the depth of field.
• Placement: the light sources must be placed such that they do not cause
lens glare in any camera.
• Natural spectrum: the lights must lend the scene its natural color appear-
ance. Strong infrared emitters, such as incandescent light sources, or stan-
dard fluorescent tube light can falsify captured object colors.
Obviously, a large studio room with high ceilings is preferable because then light
sources and acquisition cameras can be positioned at the periphery. Light sources
are characterized by their output light intensity (luminous flux) and their color
rendering index (CRI). The color rendering index ranges from 0 to 100 and in-
dicates how well colors are reproduced by different illumination conditions in
comparison to a standard light source (e.g., daylight). A CRI value of 100 repre-
sents identical visual appearance, while cool white fluorescent lamps may have a
CRI value of as little as 60.
i i
i i
i i
i i
Different types of light sources are available for stage lighting. Arc lamps as
well as tungsten-filament and halide light bulbs are based on thermal emission,
creating a continuous, broad spectrum. Unfortunately, such incandescent light
sources radiate excessive quantities of infrared light while emitting less energy in
the green and blue part of the spectrum. To avoid yellow-tinted object appearance
in the images, white balancing can be achieved by mounting appropriate filters
in front of the lights. Conventional fluorescent lamps emit “cold” light, such that
the red end of the spectrum is not suitably represented. Full-spectrum fluorescent
lamps are available; however, fluorescent tube lamps typically flicker with the
frequency of the power grid. Solid-state lights (LEDs, etc.) are just emerging
as light sources for stage illumination applications. Halide metal vapor lamps
(HMI) combine high illumination intensities with high CRI values. In an HMI
light bulb, a DC electric arc burns in an atmosphere of mercury intermixed with
halides of rare earth elements. HMI lamps exhibit a spectral energy distribution
that resembles natural sunlight, including a substantial ultraviolet component that
should be blocked via filtering. HMI lamps can deliver more than 105 lumens.
Because the light-emitting area within the HMI bulb is confined to a small spot, a
diffuser screen must be mounted in front of the light bulb to avoid sharp shadow
lines.
i i
i i
i i
i i
2.1 Acquisition 19
Figure 2.6. Client-server studio set-up: the cameras are connected to host PCs and syn-
chronized via triggering cables. The PCs communicate with the server via EthernetTM .
(Reprinted from [291], c 2003 Eurographics.)
on SCSI hard drives to store the video streams in compressed format [312]. A
custom multi-drive system using fiber optical connectors was built by PointGrey
Research Inc. for the Video View Interpolation system [345]. Storing uncom-
pressed, high-resolution video streams using conventional PC hardware remains
a challenge for VBR acquisition.
For online processing, data must be exchanged between client PCs and the dis-
play host. To keep the system scalable, as much computational load as possible
must be delegated from the display host to the client PCs. A distributed client-
server network facilitates on-line processing of the recorded multi-video imagery.
On the lowest level, each client PC controls, e.g., two synchronized video cam-
eras. Image resolution permitting, each client PC is able to perform low-level
processing on both incoming video streams at acquisition frame rate. In a hierar-
chical network, the processed data from every two clients are sent on to one PC
that processes the combined data further. The result is passed on to yet another
computer in the next-higher hierarchical layer of processing nodes, until at the
root of the hierarchical network the complete, processed information is available.
For n video cameras, such a binary tree network requires n − 1 computers. The
root node PC is the display server and renders the time-critical output image. At
the same time, it is able to control the network. To limit network traffic for video
texture transfer, the display server requests only those images from the respective
camera clients that contribute to the target view. This way, network traffic and
computational load are kept constant to accommodate arbitrarily many acquisition
i i
i i
i i
i i
cameras. For VBR algorithms that can be parallelized to benefit from distributed
processing, interactive performance is achieved using commodity PC hardware in
conjunction with a standard 100 Mb EthernetTM LAN [208, 259, 172, 207].
All time-critical VBR algorithms discussed in Chapter 4 are experimentally
evaluated using a distributed set-up as shown in Figure 2.6. The network consists
of four 1.1 GHz Athlon client PCs and one P4 1.7 GHz display server. Four client
nodes each control two Sony DFW-500 cameras via the IEEE1394 FireWireTM
bus. Camera resolution is 320 × 240 pixels RGB, acquired at 13 fps. The client
nodes are directly connected to the display server via 100 Mb EthernetTM . While
different VBR algorithms may require different graphics boards on the display
server, the network infrastructure remains the same for all experiments.
i i
i i
i i
i i
2.1 Acquisition 21
Mobile Set-Up Calibration. For accurate results, the extrinsic calibration ob-
ject should cover the entire visible scene. In a studio, the size of the stage is
preset, and a suitable calibration object can be custom-built. For a portable sys-
tem, however, camera arrangement and recorded scene size is different during
every deployment. A robust, quick, and easy-to-use camera calibration proce-
dure that performs well for arbitrary camera configurations and in a wide range
of recording environments is desirable. For such applications, the use of a virtual
calibration object has proven useful. A virtual calibration object is constructed
by swerving an easy-to-localize object, e.g., a flashlight or a brightly colored ball,
all around the scene volume [11]. A spherical marker object has the advantage
that it also works in daylight, it has the same appearance from all directions, and
its midpoint can be robustly determined from its image silhouette. If recorded in
sync, video images of the same time frame depict the calibration object at a dis-
tinct 3D position in space. This 3D position serves as one calibration point. Chen
et al. extend the virtual calibration object idea to the case of multiple unsynchro-
nized cameras [44]. In contrast to a rigid calibration object, however, absolute 3D
i i
i i
i i
i i
1 7
rigid object
6
2
3 5
4
(a) (b)
Figure 2.7. Graph-based extrinsic camera calibration: (a) example for a graph with non-
empty sets S, Sk , S̄, and (b) the corresponding camera configuration. (Reprinted from
[125].)
1.6
1.4
1.2
0.8
0.6
0.4
0.2
−0.2
Figure 2.8. Top view of a reconstructed virtual calibration object (line segments) and the
recovered camera calibration (frustum pyramids).
i i
i i
i i
i i
2.1 Acquisition 23
calibration point positions are not known, which leads to an indeterminate global
scale factor for the camera set-up.
The virtual calibration concept does not require all cameras to see the calibra-
tion marker object simultaneously. The only requirement regarding the marker
path is that the marker should be visible in at least two cameras at any time, and
that every camera should record the marker sometime during the calibration se-
quence.
Relative position and orientation between camera pairs can be determined di-
rectly from pairwise image correspondences. To register all cameras into a com-
mon reference frame, a graph-based visualization is useful [125].
The graph G = (V, E) reflects the relationship between cameras; see Fig-
ure 2.7(a). Graph nodes represent the cameras Ci ∈ V , while relative position and
orientation between pairs of cameras (Ci , Cj ) ∈ E are represented by a connect-
ing edge in the graph; see Figure 2.7(b). The graph is undirected since the relative
position and orientation (R, t) for camera pair (Ci , Cj ) automatically yields the
relative position and orientation of (Cj , Ci ) via (RT , −RT t). Once the graph is
constructed, it is searched for cyclic subsets Sk ⊆ S. The set S̄ := S \ (∪k Sk )
consists of all cameras that do not belong to any cycle.
First, the unknown scale factors for pairwise translation are determined in-
dependently for each Sk . Any cycle in the graph represents a closed curve in
three-dimensional space. By exploiting the constraint that the scaled translations
along any closed loop of the graph must sum to zero, a linear system of equa-
tions is set up for each cycle. The resulting linear system of equations is typically
over-determined and can be solved in a least-squares sense. Since the obtained
estimate is not directly related to the image-space measurements anymore, the re-
projection error of the solution is not necessarily optimal. The overall shape of
the set-up is, nevertheless, correctly recovered, and the result can be used as the
initial estimate for subsequent bundle adjustment also to optimize camera orien-
tation parameters [296, 109]. Using these partial registrations, parts of the virtual
calibration object (e.g., the path of the tracked marker object) are reconstructed.
The remaining step is to register all subsets Sk and all Ci ∈ S̄ with each other
to form one globally consistent calibration. Figure 2.8 depicts the reconstructed
object marker path and estimated camera positions and orientations from a real-
world calibration sequence.
i i
i i
i i
i i
i i
i i
i i
i i
2.1 Acquisition 25
suming that the exposure times are exact, the pixel values measured for short and
long exposures are mapped to the extrapolated linear response curve determined
from the mid-exposure times. Three look-up tables are established that map any
recorded red-, green-, and blue-channel value to the full, linear imaging range,
e.g., from 0 to 255 pixel intensity units.
To make use of all color patches, the radiometric falloff over the Macbeth
color chart due to lens vignetting and non-uniform illumination must be accounted
for. A large photographic gray card is placed in front of the Macbeth chart, cov-
ering the entire color chart. For each pixel, a scaling factor is computed from the
gray-card image to correct the color-chart recording.
To minimize difference in color reproduction between cameras, a 3 × 4 color
calibration matrix Mcolcal
i is determined for each camera i. The matrices are com-
puted by taking into account all 24 color patches j. The color chart is recorded
at the exposure time used later during VBR acquisition. For each color patch,
the recorded and falloff-corrected pixel values (rij , gij , bij ) are averaged over all
cameras i to yield average color-patch values (r̄j , ḡj , b̄j ). The color calibration
matrices are used to best match recorded color-patch pixel values to the average
color-patch value:
rij
r̄j gij
ḡj = Mcolcal
i bij . (2.3)
b̄j
1
i i
i i
i i
i i
cameras that are, to first order, similar in color, Neel Joshi and his co-workers
advise to do only the first calibration step, placing the white target close to the
cameras and recording out-of-focus [140]. If this is not feasible, they suggest to
set internal camera gain and offset to the same values for all cameras instead of
relying on the cameras’ automatic gain and white balance feature.
2.1.7 Segmentation
Image pixel affiliation with either scene background or object foreground is a
valuable clue for geometry reconstruction. Silhouette-based methods rely entirely
on accurate image segmentation. To distinguish between fore- and background
pixels from image data only, segmentation must be based on pixel color.
In professional applications, segmentation is traditionally realized via
chroma-keying. The foreground is thereby recorded in front of a uniformly col-
ored backdrop whose color does not resemble any color occurring on the fore-
ground object. Typical backdrop colors are blue or green [309]. By labeling as
background all pixels that are similar in color to the backdrop, the foreground
can be quickly and robustly segmented. This simple method can be realized even
with analog technology and is used for live broadcasts. (For example, the nightly
TV weather forecast is commonly produced by recording the meteorologist in
front of a uniform backdrop and compositing his/her color-segmented silhouette
in real-time with the computer-generated weather chart.)
Particularly in small recording rooms, however, indirect illumination causes
the background color of the walls and floor to reflect off the foreground object,
resulting in an unnatural color tint of the object as well as segmentation inaccu-
racies. This effect is commonly referred to as color bleeding. Shadows cast by
the foreground onto the background are another potential source of segmentation
error. To avoid both problems, instead of using blue or green as the backdrop
color, the studio walls and floor can be covered with black molleton and carpet;
see Figure 2.9. This way, shadows become invisible, and no background color
can reflect off the foreground. On the other hand, illumination conditions must be
adjusted for the missing indirect illumination component from the walls and floor,
and the segmentation algorithm has to account for dark or black regions that can
also occur on the foreground object.
Retro-reflective fabric material is becoming popular to cover background
walls and floor. Blue or green ring lights mounted on all camera lenses strongly
reflect back from the material into the cameras, such that in the recorded video
images, the background stands out in a single color. Scene foreground remains
conventionally illuminated with bright (white) light sources. Because the retro-
reflective background requires only low ring light intensities, color bleeding ef-
i i
i i
i i
i i
2.1 Acquisition 27
Figure 2.9. A recording studio, equipped with black curtains and carpet for robust image
segmentation. (Reprinted from [291],
c 2003 Eurographics.)
(a) (b)
Figure 2.10. Clean-plate segmentation: with the help of pre-acquired background im-
ages, (a) the dynamic foreground object can (b) be segmented based on statistical color
differences. (Reprinted from [176], courtesy of Ming Li.)
fects are avoided while shadows cast on the background from foreground lighting
drown in the retro-reflected, uni-color ring light.
If the background cannot, or is not supposed to, be manipulated, segmentation
must be based on comparing the recorded images to previously acquired images
of the background [17]. These additional clean plates depict the static background
and are recorded with the same cameras and from the identical positions as the
dynamic sequence; see Figure 2.10. Electronic noise causes the background pixel
values to vary slightly, especially in low-light conditions. To determine the aver-
age color and variance per pixel, a number of background images are recorded.
Mean color and variance are then used to compute two threshold values per pixel
i i
i i
i i
i i
that are used to decide whether the pixel belongs to the foreground or the back-
ground [46]. If the pixel value is between both threshold values, no clear decision
can be made, which is still much more useful than a bogus classification.
In the dynamic image sequences, background appearance may not be com-
pletely static due to shadows cast by the foreground. From the observation that
scene regions do not, in general, change in color appearance when shadowed,
pixel classification is based on pixel hue instead of intensity level to obtain more
robust segmentation results. Because scene background is not controlled, simi-
lar colors can coincide in the foreground and the background, typically causing
holes in the silhouettes. The segmented silhouettes are therefore post-processed
employing a morphological dilate/erode operation [134]. Suitably implemented,
background subtraction and morphological filtering can be done on-the-fly during
multi-video acquisition.
Plate V depicts eight segmented reference images for one time frame. This is
the input data to all wide-baseline VBR algorithms discussed in Chapters 4 to 6.
i i
i i
i i
i i
envelops the object’s true geometry, it cannot recover local concavities because
concave regions do not affect the object silhouette.
Another constraint that is crucial for all but silhouette-based reconstruction
approaches is diffuse reflectance. Diffusely reflecting, i.e., Lambertian, sur-
faces exhibit the beneficial property that their appearance remains constant when
viewed under arbitrary orientation, as long as the illumination is static. This ap-
pearance invariance allows the establishing of correspondences between different
views. Fortunately, most materials in nature exhibit predominantly diffuse re-
flectance properties. Photo-consistent reconstruction approaches are based on the
assumption of inter-image similarity in pixel color and intensity. To decrease the
probability of false matches, color similarity is frequently measured not per pixel
but for a small patch of pixels. As similarity measures, the sum of absolute dif-
ferences (SAD), the sum of squared differences (SSD), or the normalized cross-
correlation (NCC) between rectangular image patches are commonly used. The
NCC measure (see Equation (3.2)) is the most robust, as it is invariant to differ-
ences in overall brightness. However, none of these low-level similarity measures
are invariant to projective transformation or rotation [109].
From known image recording positions, pixel correspondences directly yield
the 3D positions of scene surface points. But even for diffusely reflecting ob-
jects, photo-consistency remains ambiguous if surfaces show no color variation.
Reliable correspondences can be established only at discontinuities. For uniform
image regions, use can be made of the regularizing observation that constant im-
age regions typically also correspond to smooth object surfaces. The flat-shaded
regions are then plausibly interpolated from reliably established 3D points.
While silhouette-based approaches inherently yield only approximations to
the true geometry of general 3D shapes, photo-consistent approaches assume
diffuse reflectance and must rely on regularization. Accordingly, reconstructed
geometry cannot be expected to be exact. When considering a reconstruction
technique for a given camera set-up, two scenarios must be distinguished. If the
input images are recorded from nearby viewpoints such that they exhibit high sim-
ilarity, depth-from-stereo algorithms that aim at assigning a depth value to each
image pixel are applicable. For images recorded from widely spaced positions,
shape-from-silhouette techniques as well as photo-hull reconstruction methods,
both of which recover a full 3D model, can be applied. In the following, the
different reconstruction approaches will be described in more detail.
i i
i i
i i
i i
able that depict the scene from similar perspectives, such that any point in the
scene is ideally visible in two or more images. For metric reconstruction, the
recording cameras must be calibrated so that the images can be related geometri-
cally [109]. Figure 2.11(a) illustrates the epipolar recording geometry for a pair
of stereo cameras C1,2 . A 3D scene point P visible in both images I1,2 estab-
lishes a correspondence between two pixels P1,2 from either image. In reverse,
a correspondence between two pixels from both images determines the 3D posi-
tion of the scene point. A pixel P1 in image I1 can have a corresponding pixel in
image I2 only along the respective epipolar line l2 , and vice versa. The epipolar
line l2 is the projection of the viewing ray from the center of projection of cam-
era C1 through P into image I2 , so the search space for pixel correspondences is
confined to the respective one-dimensional epipolar line.
To establish correspondence between pixels, image regions around the pixels
are commonly compared using sum of squared differences (SSD), sum of absolute
differences (SAD), or normalized cross-correlation (NCC) as similarity measures.
The normalized cross-correlation is generally more robust than SSD or SAD. The
NCC between two image regions I1 (Cp ) and I2 (Cq ) is defined by
¯ ¯
i I1 (pi ) − I1 (Cp ) I2 (qi ) − I2 (Cq )
C(p, q) = , (2.4)
¯ 2 ¯ 2
i I (p
1 i ) − I (C
1 p ) i I (q
2 i ) − I (C
2 q )
where pi , qi denote region pixels, and I¯1 (Cp ), I¯2 (Cq ) are the mean pixel values
of each region.
For accelerated performance, stereo image pairs can be rectified prior to image
region comparison; see Figure 2.11(b). A projective transform takes both stereo
images I1,2 to a common image plane and aligns epipolar lines in parallel to image
P P
l2:epipolar line
p1 p’’1 l’’2:epipolar line
p2 p’’2
I1 I2
I’’1 I’’2
C1 Baseline Baseline
C2 C1 C2
(a) (b)
Figure 2.11. Stereo reconstruction: (a) The 3D coordinates of a scene point P can be
triangulated from its 2D projection coordinatesP1,2 in two non-parallel image planes I1,2 .
(b) For faster block-matching, stereo image pairs can be rectified such that epipolar lines
correspond to image scanlines. (Reprinted from [176], courtesy of Ming Li.)
i i
i i
Discovering Diverse Content Through
Random Scribd Documents
Az asztalfőn, egy sarokban ültek ők ketten. A szoba lassankint
megtelt füsttel, ételgőzzel s a mint egyre több és több pálinkát
fogyasztottak, a jókedv is hangosabb és hangosabb lett. Simándy
egy jódarabig hallgatott, aztán hirtelen Florica kezére tette a kezét.
– Bizony, édes Florica, gyenge, bűnös ember vagyok én. Vajha az
Isten megbocsátaná, a mit ma a templomban ellene vétettem.
Florica ránézett és halkan suttogta:
– Mit?
– Nem is láttam az előttem álló jegyespárt, csak magára néztem
és magamra gondoltam.
– Ép’ úgy, mint én.
– Magának esküdtem.
– És én magának.
– S talán ezzel még azt a bűnt is elkövettem, hogy lelkemhez
lánczoltam a maga lelkét…
– Hát ezt bűnnek tekinti?
– Annak, mert az én lelkem össze van törve, beteg; s az én
életem a hullámokon hánykolódó rossz ladik, mely elsülyedésre van
szánva.
A lány megsimogatta Pali kezét.
– Ne beszéljen így. Tudom, hogy eddig szomorú élete volt, s én
annál jobban szeretem magát; de most majd következnek a fehér
napok.
– Fehér napok? Óh, Florica, ha tudná, hogy mennyi fekete nap
van az én elmult ifjúságomban. Én értem tengerbe ölte magát egy
szép, szőke hollandi angyal, a kinek szerelméről sejtelmem se volt; s
engem elhagyott első szerelmem, szép tavaszom legelső rügyeit
leforrázta a dér. Bús emlékek romjai és halottak sápadt arcza közt
kigyózik multam ösvénye. S ha az öregedés az átélt szenvedések
szerint történnék, én mint fehérhajú aggastyán, viaszsárga arczczal
és görbedt vállal állanék itt… Azért szeretem úgy magát, mert maga
mellett úgy érzem magamat, mint a vándor, a ki a hideg, havas,
sötét, végtelen pusztaságon lobogó pásztortűzre talál. S boldog
vagyok, mert puha kezének szorítását érzem. S szeretném, ha mi itt
összefonódva fává változnánk, s századokon át állnánk egymás
mellett… és suttogva, összehajló lombjaink közt boldog kis madarak
fészket raknának és szerelemről csicseregnének.
Így ömlött a szenvedélyes, szerelmes szavak patakja az ifjú
ajkáról… de nem egyszerre… lassan, halkan, sokáig, s a lány lelke
mohón szívta be, s koronkint megremegett az örömtől.
– Pali, édes Pali, – suttogott, – szeretlek, nagyon szeretlek.
– Vigyázz, Florica! Egy egész világ áll köztünk; te tán nem érted,
de tudd meg, hogy mi egymáséi nem lehetünk, ha csak az
előitéleteknek egész gonosz világát szét nem romboljuk.
Florica mind a két kezével megragadta Simándy kezét.
– Ha kiáltó szavadat hallanám az éjszakában, futva mennék
utánnad, nem törődve azzal, hogy kőbe botlom, tüskebokorba futok,
mélységbe zuhanok…
– Édes!
A füstből Todorescu vérfutotta arcza merült föl. A vén pap a
vadállat ösztönével mintha megérezte volna, hogy kölyke
veszedelemben forog, s megjelent.
– Papocskám, a lányomat mulattatod?
És a lányra rászólt:
– Florica, gyere, valamit akarok mondani.
A lány engedelmesen szállt ki. A pap karonragadta s elvonszolta.
– Miért nem tánczolsz? – mordult rá odakint a tornáczon. – Mit
bujsz össze azzal a pogány tatárral? Mit suttogtok folyton? Mézes-
mázos szájával tán szerelemről beszél? (Itt eltorzult a pópa arcza.)
Ha ezt tudnám, azonnal megfojtanálak. Román lány vagy – érted?
Urilány vagy – érted? Művelt lány vagy – érted? Rád uri kényelem
vár, városi élet, gazdag férj. Ne busulj, majd eljön, én gondoskodom
róla – te oly okos kis lány, jó kis lány vagy, Florica, ne keserítsd öreg
apádat. Megfojtalak, ha másképp lesz. Érted? És most eredj
tánczolni!
A lány elfordult. Apja visszarántotta.
– Addig szóba se állj vele, míg oláh nem lesz. Majd ha mellettünk
térdel, velünk imádkozik, bőjtöl és vezekel, akkor adhatsz neki
nyájas szót. Eredj!
Florica panaszos hangon megszólalt:
– Apám, hazamegyek.
– Mi bajod?
– Megfájdult a fejem a nagy füstben.
Apja megragadta a lány két vállát, s merőn az arczába nézett.
– Csakugyan nagyon sápadt vagy. Mit beszélt neked az a ficzkó?
– Semmi rosszat, apám.
A pópa megfogott egy cselédet.
– Te, küldd ki a feleségemet.
Todorescuné kijött s gondjaiba véve a leányt, hazakisérte.
Todorescu nehéz léptekkel visszament a lakodalmas házba, szembe
ült a magyar pappal s az asztalra könyökölt. Sokáig farkasszemet
néztek egymással. Simándy szó nélkül meredt a pópa kivörösödött
ábrázatára, gondolatai egészen másfelé kalandoztak. Végre a pópa
megszólalt:
– A tavaszszal a leány Szebenbe megy rokonaihoz. Örökké csak
nem ülhet itthon? Férjhez kell adni, arra való… A nyáron meg
Borszékre megy az anyjával… tán még bojár is elveszi… szép is, okos
is, művelt is… meg egy kis pénze is van.
Simándy nem felelt rá.
– Hát te nem házasodol meg, öcsém? Asszony kéne már neked is
a házhoz. Ez nem élet.
Simándy hallgatott.
– Nem felelsz? Min gondolkozol?
Simándy mélységes szomorúság rezgésével hangjában végre
megszólalt:
– Hadd abba, pópa, nem értesz te meg engem, én sem értelek
meg téged. Minek fáraszszuk magunkat üres zörgéssel?
Simándy azon gondolkozott, hogy ez az előtte álló ember mily
távol, mily idegenül áll ahhoz az emésztő szenvedélyhez, mely az ő
szivét eltölti. Pedig atyja a leánynak s ő nem tud képzelni olyan
esetet, hogy ez embernek a nyakába boruljon s megvallja, hogy a
leányát szereti. Sőt mikor az apa különös, nyugtalanító, czélzásokkal
telt beszédét hallotta, vad gyűlölet szállt föl lelke mélyéről s úgy
nézett a pópára, mintha az volna a mesebeli sárkány, melynek
körmei közül kincsét ki kell ragadnia.
A pópa fölugrott s otthagyta. Simándy is fölállt s a mint az asztalt
hátrább tolta, szeme valami csillogó tárgyat pillantott meg a földön.
Fölemelte: azonnal ráismert, Florica amuletje volt, az ismert kazáni
Szent Szűz, a gyermek Jézussal karján. Zsebébe rejtette.
Beállt az est, csöndes lett a falu és sötét, csak a lakodalmas
házból tört ki néha a vigasság zaja s a világosság. Florica barátságos
leányszobája csak egy síró és imádkozó gyermeket látott. Egyszerre
megzördült a kis kertre nyiló ablaka. A leány remegve ment oda s
félig kinyitotta.
Egy mély hang megszólalt a sötétben:
– Florica, édes Florica, úgy rémlik, elfelejtettem neked
megmondani, hogy szeretlek.
– Csak mondd, csak mondd, édes.
– Csókolj meg, Florica!
S a vasrács közt megcsókolták egymást, oly erősen, oly hevesen,
hogy a szegletes rud véres sávot vont mindegyiknek homlokára.
– Kis leány, imádkoztál-e ma este a kazáni Szűz Máriához?
– Oh, Atyám, irgalmazz, elvesztettem az amuletemet.
– Nem vesztetted el, Florica, mert a sors úgy akarta, hogy
általam megkerüljön. Fogd! Imádkozzál mindenhez, a miben hiszel,
hogy elnyerd a boldogságot. Hisz én már a fényes csillagokat is
ostromolom könyörgéseimmel. Rózsás álmokat, édes, drága szivem!
S a kis leány egyszerre boldog, vidám lett, tapsolt, ugrált s
imádkozván a Szűz Máriához, szépen elaludt. A pap pedig késő
éjszakáig botorkált a faluban ide-oda, zúgó fejjel, nehéz szívvel,
megbomlott lelkiismerettel, összetört világgal, melyben csak egy ép,
egész érzelem volt: szerelme.
XVII.
Simándy hazatért…
– Papp Mózesnél.
– Hivja el.
Nemsokára megjelent Zalathnay Gábor. Tárt karokkal, szives
köszöntéssel lépett be, de hirtelen megállt.