Karen McMenemy, Stuart Ferguson-A Hitchhiker's Guide To Virtual Reality-A K Peters - CRC Press (2007)
Karen McMenemy, Stuart Ferguson-A Hitchhiker's Guide To Virtual Reality-A K Peters - CRC Press (2007)
Karen McMenemy, Stuart Ferguson-A Hitchhiker's Guide To Virtual Reality-A K Peters - CRC Press (2007)
A Hitchhikers Guide
to Virtual Reality
i
i
A Hitchhikers Guide
to Virtual Reality
Karen McMenemy
Stuart Ferguson
A K Peters, Ltd.
Wellesley, Massachusetts
i
i
QA76.9.H85M425 2007
006.8dc22
2006102280
Cover image: The front cover illustrates a virtual figure exploring a virtual landscape. The
theory and practical implementation making this possible are described in Chapters 5, 7,
13 and 14. The inset images on the back cover illustrate the projection system designed in
Chapter 4 and some of the practical projects of Chapter 18.
Printed in Canada
11 10 09 08 07
10 9 8 7 6 5 4 3 2 1
i
i
To Cathy
i
i
Contents
Preface
xiii
1 Introduction
1.1 What is Virtual Reality? . . . . . . . . .
1.2 A Quick Tour of the Book . . . . . . . .
1.3 Assumed Knowledge and Target Audience
1.4 Valuable Resources for VR . . . . . . . .
1.5 The Code Listings . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
7
8
9
10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
14
18
21
23
24
25
26
29
30
31
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
i
i
viii
Contents
3.4 Education . . . . . . .
3.5 Other Applications . . .
3.6 Distributed VR . . . .
3.7 The Implications of VR
3.8 Summary . . . . . . .
Bibliography . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
36
38
39
42
43
.
.
.
.
.
45
47
64
71
83
84
.
.
.
.
.
.
.
87
88
94
102
105
111
116
116
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
119
120
123
124
125
129
131
145
145
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
147
149
150
155
156
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i
i
Contents
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
159
172
179
180
8 Computer Vision in VR
8.1 The Mathematical Language of Geometric Computer Vision
8.2 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 A Brief Look at Some Advanced Ideas in Computer Vision
8.4 Software Libraries for Computer Vision . . . . . . . . . .
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . .
181
183
185
189
214
216
217
9 Image-Based Rendering
9.1 General Approaches to IBR . . . . . . .
9.2 Acquiring Images for IBR . . . . . . . .
9.3 Mosaicing and Making Panoramic Images
9.4 Summary . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . .
.
.
.
.
.
219
220
223
224
230
231
.
.
.
.
.
235
237
246
249
256
256
.
.
.
.
.
.
257
258
260
266
271
281
282
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10 Stereopsis
10.1 Parallax . . . . . . . . . . . . . . . . . . . .
10.2 Head-Mounted Displays . . . . . . . . . . . .
10.3 Active, Passive and Other Stereoscopic Systems
10.4 Summary . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . .
11 Navigation and Movement in VR
11.1 Computer Animation . . .
11.2 Moving and Rotating in 3D
11.3 Robotic Motion . . . . . .
11.4 Inverse Kinematics . . . . .
11.5 Summary . . . . . . . . .
Bibliography . . . . . . . . . . .
Summing Up
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
283
i
i
Contents
II
287
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
289
290
293
294
296
300
301
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
303
304
333
336
349
351
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
353
354
365
370
382
382
383
15 Using Multimedia in VR
15.1 A Brief Overview of DirectShow . . . . .
15.2 Methodology . . . . . . . . . . . . . . .
15.3 A Movie Player . . . . . . . . . . . . . . .
15.4 Video Sources . . . . . . . . . . . . . . .
15.5 Custom Filters and Utility Programs . . . .
15.6 Playing Movies or Live Video into Textures
15.7 Summary . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
385
386
391
393
401
406
413
431
431
16 Programming Stereopsis
16.1 Display Adapter Hardware for Stereopsis . . . . . . . . .
16.2 File Formats for Storing Stereoscopic Images . . . . . . . .
16.3 A File Format for Stereo Movies . . . . . . . . . . . . . .
433
434
438
441
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i
i
Contents
xi
.
.
.
.
.
.
.
.
.
.
.
.
441
446
459
467
475
475
.
.
.
.
.
.
.
477
479
484
489
494
496
505
505
507
508
.
.
.
.
.
.
.
.
.
510
510
511
514
520
521
526
534
534
.
.
.
.
.
537
538
539
542
542
544
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
547
549
i
i
xii
Contents
557
565
579
579
581
Index
585
i
i
Preface
The seeds of the ideas that have become this book were sown on the breakfast
terrace of a hotel in Beverly Hills during SIGGRAPH 2005. Whilst preparing
for another interesting day at the conference, we faced the immediate task of
how to journey downtown, no mean task in LA. So as we turned to a small
guidebook for advice, the thought struck us that it would be really useful if we
had a similar informative and practical guide to virtual reality (VR). At that
time, we were preparing to build the next phase of our own VR research lab,
and to be able to do that, it isnt enough just to know about the background
theory and technology that might be useful. We needed practical solutions
hardware and software solutions. Not finding anything suitable, we decided
to compile our owna guide that would bring together under one cover all
the aspects of graphics, video, audio and haptics that have to work together
to make virtual reality a reality.
i
i
xiv
Preface
Part I forms the core of the book. We examine the human senses and
their significance in delivering a sense of reality within the virtual world. We
describe the types of interface technologies that are available and how they
work. After reading the first few chapters, it should be evident that being able
to see the virtual world is of prime importance. So several chapters in Part I
are focused on all aspects of how we interface with the virtual world through
our sense of sight.
Part II of the book is in a different format. Titled Practical Programs for
VR, the text is tightly integrated with the CD. A wide spectrum of example
programs for use in practical VR work are described and explained in Part II.
The examples complement and make concrete the concepts covered in Part I.
We have found them exceptionally useful, and many are in use in our own
VR laboratory. The programs are written in the C or C++ languages and are
targeted for the Windows PC platform.
As we stressed before, the key element of VR is the visual one, and here,
perhaps more than in any other aspect of the human-computer interface,
software plays the dominant role. We show you how to use the main 3D
rendering libraries in interactive real-time applications. Real-time 3D is not
the only source of content for VR applications; movies, DVDs and other
multimedia material can also play a vital role. We show you how to use
Microsofts DirectX technology to achieve this with minimal effort.
The level and sophistication of interaction set many VR applications apart
from computer games and computer-generated movies. Programming this
interaction poses its own complexities. The keyboard and mouse may be very
well for many things, but VR needs more: two-handed input with multiple
degrees of freedom, for example. We offer some examples of how to use the
USB PC interface and the input components of DirectX. Part II also discusses
the challenges that a programmer faces when working with haptic devices.
Youll find more specific detail about the books content in Chapter 1, but
we hope your curiosity is sufficiently excited to read on. We found our guide
very useful and enjoyed writing it; we hope you find it useful too and enjoy
reading it.
Acknowledgments
Many individuals have given us much valued ideas, feedback and other assistance in preparing the book. Others have inspired a more general interest in
this fascinating and engrossing subject; to all of you, many thanks.
i
i
Preface
xv
i
i
i
i
Introduction
1.1
Before we take a little tour of the topics and chapters in the remainder of the
book it is worth pausing briefly to consider what we mean by virtual reality.
First, it can also go by several other names. Augmented reality, spatial
reality, mixed reality, collaborative reality and even tangible reality are all used
3
i
i
1. Introduction
1.2
The book is in two parts, and you could read them almost independently.
The CD is an integral part of the book: its contents relate primarily to the
programs discussed in the second part, and you will find readme files with
additional details about the programs that we just dont have space to put in
the book.
We have tried to follow a logical progression: what VR fundamentally
aims to achieve, to what it can be applied, what elements need to be brought
together, how they work and how the theory is turned into the practice.
We also consider some key concepts from the allied disciplines of computer
graphics and computer vision which overlap with VR.
Part I is comprised of Chapters 1 through 11 and forms the core of the
book. After a brief introduction in Chapter 1, we move to Chapter 2 where
we examine the human senses and their significance in delivering a sense of
reality within the virtual world. The chapter also explains why some senses
are easier to interface with the virtual world than others.
Chapter 3 reviews a spectrum of applications where VR has proven to
be an invaluable tool. It discusses in general terms the benefits to be gained
in using VR, as well as the downside. Users of VR may be affected in ways
that they are unaware of, and there could be wider implications which are not
always obvious.
In Chapter 4, we take our first steps into the practical world of VR by
describing the types of interface technologies that are available and how they
work. We show how different types of human interface devices are appro1
Haptic is the term in common use for discussing all aspects of being able to touch and
feel things. We will return to it several times.
i
i
priate for different applications of VR. To really integrate the real with the
virtual, we must also be able to match positions and directions in the users
reality with the computers corresponding virtual reality. This brings in the
human interface again, but here it is more a case of the virtual system sensing
and responding to real events rather than being commanded by direct and
explicit input, or a command to generate a view from such-and-such a place
and in such-and-such a direction. As such, Chapter 4 also includes a section
on the technology of motion tracking and capture.
After reading the first few chapters, it should be evident that being able to
see the virtual world is of prime importance. It is hard to imagine VR without
its 3D computer-generated graphics. Thanks to the incredible popularity of
3D computer games, the technology for visualization has achieved undreamtof degrees of realism, and a major aspect of our book explains how this is
achieved. Chapter 5 starts our discussion of visualization by explaining how
the virtual world may be specified in ways the computer can understand and
use to create a visual input to our senses. After a brief review of some basic
mathematics for 3D graphics in Chapter 6, Chapter 7 explains the essence of
the algorithms that are used to create the real-time views.
In these chapters we explore the basic principles of the two key elements
to VR we identified earlier. However, before turning to consider how to
program the computer to deliver a VR experience, there are a few more aspects regarding visualization that are worth exploring further. The science of
computer vision is not normally associated with VR, but many of the things
that computer vision makes possible (such as using a virtual environment described by pictures rather than 3D numerical data) open up some interesting
applications of VR. Consequently, we feel a basic introduction to this often
impenetrable mathematical subject should form part of a study of VR, and
we include one in Chapter 8. We extend this discussion in Chapter 9 by
introducing the concept of image-based rendering, which offers a powerful
alternative to traditional geometry-based techniques for creating images.
In the penultimate chapter of Part I, we explore in detail the subject of
stereopsis. If you dont already know what it is, you will learn all about it. If
you do know what it is, you will already appreciate the essential enhancement
it brings to any VR experience.
Of course, even generating a 3D movie in real time is not enough for VR;
we must be able to generate the real-time view interactively. We must be able
to move about the virtual world, navigate through it and choose the direction
in which we look or the objects we look at. We examine the relevant aspects
of this movement in Chapter 11.
i
i
1. Introduction
Part I of the book is drawn to a close with a brief summary of the most
significant aspects of VR introduced thus far and offers an opinion on the
future for VR. It identifies the most difficult challenges in the short term
and ponders just how far it may be possible to go in making a virtual reality
indistinguishable from reality.
Part II of the book is in a different format. Titled Practical Programs
for VR, the text is tightly integrated with the accompanying CD, where all
the code for each application project is available. The text is concise. Most
of the programs, especially the projects, have much fuller explanations embedded as comments in the source code than we have space to print. The
programs are designed to cover the spectrum of applications needed to get a
VR environment up and running using only a basic personal computer and
the minimum of additional hardware. The examples complement and make
concrete the concepts covered in Part I. We have found them exceptionally
useful, and many are in use in our own VR laboratory. The programs are written in the C/C++ language and are targeted for the Windows PC platform.
We hope that even if you have only a basic (no pun intended) knowledge of
programming, you should still be able to modify the user-interface elements
and appearance. Chapter 12 introduces the tools, libraries and programming
conventions we use throughout Part II.
As we stressed before, the key element of VR is the visual one, and here,
perhaps more than in any other aspect of the human-computer interface, software plays the dominant role. In Chapter 13, we show you how to use the
main 3D rendering libraries in interactive real-time applications. We pursue
this in Chapter 14, where we show how to take advantage of the programmability of the graphics hardware so as to significantly increase the realism of the
virtual scenes we want to experience.
Real-time 3D is not the only source of content for VR applications;
movies, DVDs and other multimedia material can also play a vital role. Realtime video input, sometimes mixed with 3D material, is another important
source. Trying to build application programs to accommodate the huge variety of source formats is a daunting challenge. In Chapter 15, we show you
how to use Microsofts DirectX technology to achieve this with minimal effort. This chapter also provides examples of how to combine video sources
with 3D shapes and content, and leads into Chapter 16, which focuses on
programming for stereopsis.
The level and sophistication of interaction set many VR applications apart
from computer games and computer-generated movies. Programming this
interaction poses its own complexities. The keyboard and mouse may be
i
i
very good for many things, but VR needs more: two-handed input with
multiple degrees of freedom, for example. In Chapter 17, we offer some
examples of how to use the USB PC interface and the input components of
DirectX to lift restrictions on the type and formats of input. The chapter also
explains the challenges that a programmer faces when working with haptic
devices.
In our final chapter (Chapter 18), we bring together all the elements to
form a series of projects that illustrate the concepts of our book.
1.3
This book covers a lot of ground: there is some mathematics, a little electronic
engineering, a bit of science and a quite a lot of programming. We try not
to assume that the reader is a specialist in any of these areas. The material
in Chapters 2 through 11 could be used as the text for a first course in VR.
For this we only assume that the reader is familiar with the concepts of the
vector, has a basic knowledge of coordinate geometry and has studied science
or engineering to high-school level. The chapter on computer vision may
require a little more confidence in using vectors and matrices and a bit of
calculus, but it should be useful to postgraduate students embarking on a
research degree that needs to use computer vision in VR.
The second part of the book is intended for a mix of readers. It is basically about programming and writing computer programs to get a job done.
We assume the reader is familiar with the ideas in Part I. To get the most out
of this part, you should know the C/C++ programming language and preferably have a bit of experience writing applications for Windows. However,
in Chapter 12 we summarize the key Windows programming concepts. We
have used some of this material in our masters degree program, and it should
also fit well with a final-year degree course, based on the material in Part I as
a prerequisite.
In Part II, there are a lot of practically useful programs and pieces of code
that research engineers or scientists might find useful. Part II could also offer
a starting point for people who wish to put together their own experimental
VR environment at low cost. Weve used these programs to build one of our
own.
i
i
1. Introduction
1.4
i
i
1.5
All the application programs that we will examine in Part II use an overall
skeleton structure that is approximately the same. Each application will require some minor modifications or additions to each of the functions listed.
We will not reprint duplicate code nor every line of code either. Generally,
we will refer to additions and specific lines and fragments if they are significant. Using our annotated printed listings, it should be possible to gain an
understanding of the program logic and follow the execution path. Using the
readme files and comments in the code, it should be well within your grasp to
adapt the complete set of application programs on the CD.
i
i
10
1. Introduction
Bibliography
[1] ARToolKit. http://www.hitl.washington.edu/artoolkit/.
[2] O. Bimber, R. Raskar. Spatial Augmented Reality: Merging Real and Virtual
Worlds. Wellesley, MA: A K Peters, 2005.
[3] D. A. Bowman, E. Kruijffm, J. J. LaViola and I. Poupyrev. 3D User Interfaces:
Theory and Practice. Boston, MA: Addison Wesley, 2005.
[4] P. Burger and D. Gillies. Interactive Computer Graphics. Reading, MA: AddisonWesley, 1989.
[5] R. S. Ferguson. Practical Algorithms for 3D Computer Graphics. Natick, MA:
A K Peters, 2001.
[6] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision.
Cambridge, UK: Cambridge University Press, 2003.
[7] IEEE Computer Graphics and Applications Journal: IEEE Computer Society.
[8] Intel Corp. OpenCV: Open Source Computer Vision Library. http://www.intel.
com/technology/computing/opencv/.
[9] F. Luna. Introduction to 3D Game Programming with DirectX 9.0. Plano, TX:
Wordware Publishing Inc., 2002.
[10] T. Moller and E. Haines. Real-Time Rendering. Natick, MA: A K Peters, 1999.
[11] J. Neider, T. Davis and M. Woo. OpenGL Programming Guide, Fifth Edition.
Reading, MA: Addison-Wesley, 2006.
[12] M. Nixon and A. Aguado. Feature Extraction and Image Processing. Burlington,
MA: Elsevier, 2005.
[13] The Open Inventor Group. The Inventor Mentor: Programming Object-Oriented
3D Graphics with OpenInventor, Release 2 (OTL). Reading MA: AddisonWesley, 1994.
[14] J. R. Parker. Algorithms for Image Processing and Computer Vision. New York:
John Wiley and Sons, 1997.
[15] M. Pesce. Programming Microsoft DirectShow for Digital Video and Television.
Redmond, WA: Microsoft Press, 2003.
[16] B. E. Rector and J. M. Newcomer. Win23 Programming. Reading, MA:
Addison-Wesley, 1997.
[17] R. Rost. OpenGL Shading Language, Fifth Edition. Reading, MA: AddisonWesley, 2004.
i
i
Bibliography
11
i
i
The Human
Senses and VR
The human body is a beautiful piece of engineering that has five major senses
which allow it to gather information about the world surrounding it. These
five senses were first defined by Aristotle to be sight, hearing, touch, smell
and taste. Whilst these five major senses have been recognized for some time,
recently other senses have been named, such as pain, balance, movement and
so on. Probably the most interesting of these is the perception of body awareness. This essentially refers to an awareness of where ones body parts are at
any one time. For example, if a person closes her eyes and moves her arm
then she will be aware of where her hands are, even though they are not being
detected by other senses.
The senses receive information from outside and inside the body. This
information must then be interpreted by the human brain. The process of
receiving information via the senses and then interpreting the information
via the brain is known as perception.
When creating a virtual world, it is important to be able to emulate the
human perception process; that is, the key to VR is trying to trick or deceive
the human perceptual system into believing they are part of the virtual world.
Thus a perfect feeling of immersion in a virtual world means that the major
senses have to be stimulated, including of course an awareness of where we are
within the virtual world. To do this, we need to substitute the real information received by these senses with artificially generated information. In this
manner, we can replace the real world with the virtual one. The impression
of being present in this virtual world is called virtual reality.
Any VR system must stimulate the key human senses in a realistic way.
The most important senses by which humans receive information about the
world surrounding them are vision, hearing, and touch. As such, these are
13
i
i
14
2.1
2.1.1
The Retina
The retina is the light-sensitive layer at the back of the eye. It is often thought
of as being analogous to the sensor in a camera that converts light energy
to electrical pulses. However, its function goes beyond just creating these
pulses.
Photosensitive cells called rods and cones in the retina convert incidental light energy into signals that are carried to the brain by the optic nerve.
Roughly 126 million rods and cones are intermingled non-uniformly over
the retina. Approximately 120 million of these are rods, which are exceedingly sensitive. They respond in light too dim for the cones to respond to, yet
are unable to distinguish color, and the images they relay are not well defined.
These rods are responsible for night and peripheral vision. In contrast, the
ensemble of 6 million cones performs in bright conditions, giving detailed
colored views. Experiments have shown that among the cones, there are three
different types of color receptors. Response curves for these three cones correspond roughly to red, green and blue detectors. As such, it is convenient to
use the red-green-blue (RGB) primary color model for computer graphics and
i
i
15
2.1.2
Depth Perception
i
i
16
Figure 2.1. Shadows give a sense of the relative position between objects.
2.1.3
i
i
17
i
i
18
2.2
Our ears form the most obvious part of our auditory system, guiding sound
waves into the auditory canal.
The canal itself enhances the sounds we hear and directs them to the
eardrum.
The eardrum converts the sound waves into mechanical vibrations.
In the middle ear, three tiny bones amplify slight sounds by a factor of
30.
The inner ear changes the mechanical vibrations into electrochemical
signals which can then be processed by the brain.
The electrochemical signals are sent to the brain via the auditory nerve.
Humans have three different types of sound cues. Interaural time difference is a measure of the difference in time between when a sound enters our
left ear and when it enters our right ear. Interaural intensity difference is a measure of how a sounds intensity level drops off with distance. Acoustic shadow
is the effect of higher frequency sounds being blocked by objects between the
sounds source and us. Using these cues, we are able to sense where the sound
is coming from.
2.2.1
i
i
19
The Basics
The basic type of sound system is stereo sound using two loudspeakers, where
it is possible to localize a sound anywhere between the two speakers (Figure 2.2). The idea is pretty obvious:
To place a sound on the left, send its signal to the left loudspeaker;
to place it on the right, send its signal to the right loudspeaker;
i
i
20
if the same signal is sent to both speakers, a phantom source will appear to originate from a point midway between the two loudspeakers;
the position of the phantom source can also be shifted by exploiting
the precedence effect. If the sound on the left is delayed relative to the
sound on the right, the listener will localize the sound on the right side.
This principle can be further extended for multi-channel systems. That
is, a separate channel is created for each desired location (Figure 2.3). These
systems can produce impressive spatial results.
i
i
21
and change depending on the location of the sound source in relation to the
head. In other words, there is not a single transfer function, but a large set of
transfer functionsone for each direction the sound comes from. Finally, it
is important to know that there is no universal transfer function. Each person
has his own unique HRTF.
To measure the HRTF of an individual:
A loudspeaker is placed at a known location relative to the listener;
a known signal is played through that speaker and recorded using microphones at the position of the listener.
The listener cannot tell the difference between the sound from the source
and the same sound played back by a computer and filtered by the HRTF
corresponding to the original sources location. Such sound technology can
typically be built into head-mounted displays (HMDs) and is at present one
of the best methods by which we can re-create 3D interactive sound. Interesting work in this area has been carried out by Florida International University,
and further reading on the subject can be found in [6, 7].
2.3
The human sense of touch is more aptly described as our somatic sense, to
better reflect all the complex behavior actually involved. We already know
that touch has been classified as one of the five classic senses. However, when
we touch something or someone, we actually perceive more information than
can be gained from simple touch. For example, we can perceive pressure,
shape, texture, temperature, movements and so on. Thus, the simple term
touch involves many senses. However, we need to refine this complex topic,
and so we limit what we mean by our sense of touch to our haptic system.
Haptic, derived from the Greek word haptesthai, meaning to touch, emerged
in the 1990s as the term used to describe the sensation of communicating
with a computer through the sense of touch. Haptic sensory information
incorporates both tactile and kinesthetic information.
So what happens when we touch something? Initially, the sense of contact
is provided by the touch receptors in the skin, which also provide information
on the shape and texture of the object. When the hand applies more force,
kinesthetic information comes into play, providing details about the position
and motion of the hand and arm, and the forces acting on them, to give a
sense of total contact forces and even weight.
i
i
22
There are basically four kinds of sensory organs in the skin of the human hand that mediate the sense of touch. These are the Meissners corpuscles, Pacinian corpuscles, Merkels disks and Ruffini endings. The Meissners
corpuscles are found throughout the skin, with a large concentration on
fingertips and palms. They are sensitive to touch, sharp edges and
vibrations, but cannot detect pain or pressure. The Pacinian corpuscles are
larger and fewer in number than the Meissners corpuscles and are found in
deeper layers of the skin. They are sensitive to pressure and touch. The
Merkels disks are found in clusters beneath ridges of the fingertips. They
are the most sensitive receptor to vibrations. Finally, the Ruffini endings are
slowly adapting receptors that can sense or measure pressure when the skin is
stretched.
The sense of kinesthesis is less well understood than the sense of touch,
although joint angle receptors in our fingers do include the Ruffini endings
and the Pascinian corpuscles. Thus, haptic interaction between a human
and a computer requires a special type of device that can convert human
movement into quantities that can be processed by the computer. The result
of the processing should then be converted back into meaningful forces or
pressure applied to the human as a result of her initial action. This is not an
easy thing to achieve.
2.3.1
The human haptic system has an important role to play in interaction with
VR. Unlike the visual and auditory systems, the haptic sense is capable of both
sensing and acting on the real and virtual environment and is an indispensable
part of many human activities.
As we have said, the human haptic system incorporates both tactile and
kinesthetic information, and this can be dealt with through the study of tactile
feedback and force feedback. Tactile feedback refers to devices that can interact
with the nerve endings in the skin to relate information about heat, pressure,
texture etc. Force feedback refers to devices which interact with muscles that
will then give the user the sensation of pressure being applied.
In order to provide the realism needed for effective and compelling applications, VR systems need to provide inputs to, and mirror the outputs of, the
haptic system. However, this is not easy to achieve, and it still has problems
that need to be resolved. The principal problem is that whilst we know the
hands sensory capabilities, they are not yet totally understood.
However, several attempts have been made at creating realistic haptic devices, and these operate under a number of different principles:
i
i
23
2.4
In VR, the sense of smell has been largely ignored as an input for the VR
participant, in spite of the fact that smell provides a rich source of information
for us [1], be it pleasant or unpleasant. Indeed, Cater [3] has conducted
research on how important the ambient smell of the physical environment
used to house the VR system is to enhancing the sense of presence within a
virtual environment. For example, a participant in the virtual environment
might be entering a virtual forest but still be able to smell the laboratory in
which he is sitting. Thus, we are going to take a quick look at some of the
devices used to enhance the sense of smell within VR. But before doing that,
it is useful to have some idea about how our olfactory system, which gives us
a sense of smell, actually works.
Odors are formed by the small molecules of gaseous chemicals that come
from many different things, such as flowers or food. These odors are inhaled
i
i
24
through the nasal cavity and diffuse through it until they reach its ceiling.
This is covered with a layer of microscopic hair-like nerve cells that respond
to the presence of these molecules by generating electrical impulses. The
impulses are transmitted into the olfactory bulb, which then forwards the
electrical activity to other parts of the olfactory system and the rest of the
central nervous system via the olfactory tract. This then senses the odor.
Devices have been developed to automatically detect various odors for use
within virtual environments. Some of these devices consist of sensors which
try to identify the smell. These sensors may be electrochemical, mechanical, thermal and/or magnetic, and can measure different characteristics of the
smell. Each characteristic can then be compared against a list of characteristics for predetermined smells, until a match can be found. However, this type
of sensing system becomes difficult when there are many odors to be sensed,
and the electrochemical sensors required are expensive [9].
These devices have been superseded by newer technologies such as the
air cannon, developed by the Advanced Telecommunications Research
Institute in Japan [12]. This device is capable of tracking an individual and
shooting an aroma directly into his nose. Again, one of the main drawbacks
of the system is that it can currently only synthesize a limited number of
odors.
Others types of devices range from those that simply inject odors into the
physical environment to those which transfer scented air to the nose via tubes,
which are usually attached to a head-mounted display. Thus, devices are now
available to include the sense of smell in our VR worlds. Researchers are
now concentrating their efforts into verifying whether this additional sense
does play an important role in VR and what applications it has. One of the
biggest areas being researched is in surgical simulation, with many people
believing that smell, in particular, is very important to medical practitioners
when making a diagnosis [11].
2.5
Interestingly, your sense of taste very much depends on your sense of smell.
When eating, odors from your food stimulate your olfactory system. If you
like what you are smelling then you should like what you are tasting! Taste,
or gustation, is again based on chemical sensing. There are five well-known
types of receptors in the tongue that can detect sweet, sour, salt and bitter.
The fifth receptor, called the Umami receptor, can detect the amino acid
i
i
2.6. Summary
25
2.6
Summary
i
i
26
But before we exert the effort required for such compelling VR experiences,
we need to be sure that the effort is justified. In the next chapter, we look
at different applications of VR and ask whether the technology has scientific
justification or whether it is just a play thing for the rich.
Bibliography
[1] W. Barfield and E. Danas. Comments on the Use of Olfactory Displays for
Virtual Environments. Presence 5:1 (1995) 109121.
[2] M. Bouzit et al. The Rutgers Master IINew Design Force Feedback Glove.
IEEE/ASME Transactions on Mechatronics 7:2 (2002) 256263.
[3] J. Cater. The Nose Have It! Letters to the Editor, Presence 1:4 (1992) 493
494.
[4] M. Finch et al. Surface Modification Tools in a Virtual Environment Interface
to a Scanning Probe Microscope. In Proceedings of the 1995 Symposium on
Interactive 3D Graphics, pp. 1318. New York: ACM Press, 1995.
[5] H. Gurocak et al. Weight Sensation in Virtual Environments Using a Haptic
Device with Air Jets. Journal of Computing and Information Science in Engineering 3:2 (2003) 130135.
[6] N. Gupta et al. Improved Localization of Virtual Sound by Spectral Modification of HRTFs to Simulate Protruding Pinnae. In Proceedings of the 6th
World Multiconference on Systemic, Cybernetics and Informatics, pp. III-291III296. Caracas, Venezuela: International Institute of Informatics and Systemics,
2002.
[7] N. Gupta et al. Modification of Head-Related Transfer Functions to Improve
the Spatial Definition of Virtual Sounds. In Proceedings of the 15th Florida
Conference on Recent Advances in Robotics, CD-ROM format. Miami, Florida:
Florida International University, 2002.
[8] H. Iwata et al. Food Simulator: A Haptic Interface for Biting. In Proceedings of the IEEE Virtual Reality 2004, pp. 5157. Washington, D.C.: IEEE
Computer Society, 2004.
[9] P. Keller, R. Kouzes and L. Kangas. Transmission of Olfactory Information for
Telemedicine. In Interactive Technology and the New Paradigm for Healthcare,
edited by R. M. Satava et al., pp. 168172. Amsterdam: IOS Press, 1995.
[10] P. Wellman et al. Mechanical Design and Control of a High-Bandwidth Shape
Memory Alloy Tactile Display. In Proceedings of the Fifth International Symposium of Experimental Robotics, pp. 5666. London: Springer-Verlag, 1997.
i
i
Bibliography
27
[11] S. Spencer. Incorporating the Sense of Smell into Patient and Haptic Surgical
Simulators. IEEE Transactions on Information Techology in Biomedicine 10:1
(2006) 168173.
[12] Y. Yanagida et al. Projection-Based Olfactory Display with Nose Tracking.
In Proceedings of the IEEE Virtual Reality 2004, pp. 4350. Washington, D.C.:
IEEE Computer Society, 2004.
i
i
Applications
and Implications
of VR
According to engineering legend, this guy Fubini came up with a law (excuse
the lack of reference here, but it is a legend after all). It may not be a law in
the true sense of the word, but it certainly is an interesting observation that
goes something like this:
People initially use technology to do what they do nowonly faster;
then they gradually begin to use technology to do new things;
the new things change lifestyles and work styles;
the new lifestyles and work styles change society;
and eventually this changes technology.
Looking back on the incredible advances that have been made in technology,
we can see this simple law is breathtakingly accurate. Look at the Internet
it has revolutionized the world of communications, like nothing has done
before it. The Internet began in the 1960s with an arbitrary idea of connecting multiple independent networks together in order to share scientific and
military research. As such, the early Internet was only used by scientists and
engineers. Today, nearly every home has an Internet connection, and we use
the Internet for our banking needs, shopping desires etc. In fact, we dont
even need to venture outside our homes anymore! However, this increased
usage of the Internet has necessitated the development of high-speed processors, larger bandwidth communications and more. Now try to tell us that
29
i
i
30
the Internet has not changed lifestyles and work styles, and a result our future
technology needs.
So what about VR? Does it have the same potential to impact upon our
daily lives the way the Internet has? Whilst VR is still a rapidly evolving technology undergoing active research and development so that it can be made
more robust, reliable and user-friendly, many potential applications have been
identified for its use. Indeed, it is already used in some specialized areas. In
this chapter, we would like to outline some of the current and potential uses
of VR. And we ask the question that if this new technology has the potential
to change lifestyles and work styles, will it do so for the betterment of society?
To put it another way, the main idea of this chapter is to give a brief outline
of the current and potential future uses for VR so that you can begin to think
about how you could use this technology. Then, as you progress through the
remainder of the book, where we show you what hardware and software you
need to achieve various degrees of realism within numerous types of virtual
worlds, we hope you will be enthused to implement some of your own ideas
and start creating your own VR system.
3.1
Entertainment
Cinema, TV and computer games fall into this category. In fact, computer
game development has somewhat driven the development of VR technology.
In computer games, being fast and furious is the name of the game, with as
many twists and turns as can be accomplished in real time as possible. The
quality of images produced from games rendering engines improves daily,
and the ingenuity of game developers ensures that this is a very active area
of research. And the market of course is huge. In 2000, some 60 percent
of Americans had a PC connected to the Internet, and 20 percent had a
gaming console. Gaming companies now make more money than the movie
industry ($6.1 billion annually) [5], and of course a large portion of this
profit is poured into research and development to keep pushing forward the
boundaries of gaming.
This is good news, because any developments in computer gaming are
almost certainly replicated in VR technology. Indeed, it is sometimes difficult to understand the difference between computer gaming and VR. In
fact, many researchers are customizing existing computer games to create VR
worlds. Researchers at Quebec University suggest that computer games might
be a cheap and easy-to-use form of VR for the treatment for phobias. The
main reason for this is that the games provide highly realistic graphics and can
i
i
3.2. Visualization
31
3.2
Visualization
It isnt just the world of entertainment that is driving VR technology; psychologists, medical researchers and manufacturing industries are all showing
a keen interest. Scientific visualization is the process of using VR technology
(and in particular computer graphics) to illustrate experimental or theoretical
data, with the aim of bringing into focus trends, anomalies or special features
that might otherwise go unnoticed if they were presented in simple tabular
form or by lists of numbers. In this category, one might include medical
imaging, presentations of weather or economic forecasts and interpretations
of physical phenomena.
i
i
32
i
i
3.3. Training
33
3.3
Training
Training is one of the most rapidly growing application areas of VR. We all
know about the virtual training simulators for pilots. These simulators rely
on visual and haptic feedback to simulate the feeling of flying whilst being
i
i
34
seated in a closed mechanical device. These were one of the first simulators to be developed, but now we have simulators for firefighters, surgeons,
space mission controls and so on. The appeal of VR for training is that it
can provide training nearly equal to practicing with real-life systems, but at a
much reduced cost and in a safer environment. In fact, VR is the ideal training medium for performing tasks in dangerous or hazardous environments,
because the trainee can be exposed to life-threatening training scenarios
under a safely controlled computer-generated environment. If the trainee
makes a mistake then the simulation can be stopped and the trainer will have
the chance to explain to the trainee exactly what went wrong. Then the
trainee will have the opportunity to begin the task again. This allows the
trainee to develop his skills before being exposed to dangerous real-life environments.
Many areas of training can take advantage of VR. For example:
At the University of Sheffield (UK), researchers are investigating a flexible solution for VR to be used for training police officers how to deal
with traffic accidents [14];
the Simulation and Synthetics Laboratory Environment at Cranfield
University (UK) has developed a number of VR training simulators
[12]. These include surgical procedure training, flight deck officer
training and parachute training;
Fifth Dimension Technologies has developed a range of driving simulators for cars, trucks, forklifts, surface and underground mining vehicles. These simulators range from a single computer screen with a
game-type steering console to a top-of-the-range system which consists
of a vehicle with a field of view of 360 mounted on a motion base.
The explosion of VR for training purposes can be well justified. It is an
ideal tool for training people how to use expensive equipment. This not only
safeguards the equipment, but also means that it will not have to be taken out
of the production line for training purposes. VR is also cost-effective when
the equipment has a high running cost that could not be justified for training
purposes only.
Of course the military is also a huge user of VR technology, not just for
training pilots but also for training personnel in all forms of combat simulations. Despite this, it is in fact the entertainment industry who is driving
the technological advances needed for military and commercial VR systems.
In many ways, the entertainment industry has grown far beyond its military counterpart in influence, capabilities and investments in this area. For
i
i
3.4. Education
35
example, Microsoft alone expected to spend $3.8 billion on research and development in 2001, compared to the US Armys total science and technology
budget of $1.4 billion [5]. Indeed, the computer-game industry has considerable expertise in games with military content (for example, war games,
simulations and shooter games) [5] and is in fact driving the next level of
war-game simulations. Macedonia highlights the development of these simulations [5]. He discusses the US Armys training needs and goals, and how the
Army came to realize the best way to achieve these goals was to work handin-hand with academia and Hollywood. The use of VR for military purposes
is self-explanatory. It allows soldiers to experience battlefields without endangering their own lives. Obviously, mistakes made in a virtual battlefield are
less permanent and costly than they would be in reality.
And of course, lets not forget about the use of VR in medical training.
We gave an overview of the use of VR for medical visualization. This overlaps
somewhat with training applications, for example, keyhole surgery etc. The
benefits of VR training here again are self-evident.
And we could go on and on. Essentially, anything that people require
training for can be implemented in a virtual environment. It is only the
imagination which limits the applications in the area of training.
3.4
Education
i
i
36
3.5
Other Applications
The list is almost endless. An easier question to pose is what cant VR technology be used for? Looking at the conference paper titles of the Virtual
Reality Conference 1995, we can see a list of different areas of research that
VR technology was being developed for more than 10 years ago.
Theoretical and practical issues for the use of virtual reality in the cognitive rehabilitation of persons with acquired brain injuries.
Using virtual and augmented reality to control an assistive mobile robot.
Remote therapy for people with aphasia.
Ethical pathways to virtual learning.
Virtual reality in the assessment and treatment of body image.
i
i
37
i
i
38
3.6
Distributed VR
One of the hottest topics in VR research is distributed VR. This means not
only can we run a simulated world on one computer, but we can run that
same world on several connected computers or hosts. So in theory, people
all over the world can connect and communicate within the same virtual
environment. In terms of communication, the crudest method available is
to communicate with other users by simple text messages. At the high-tech
end, people can be represented by avatars and thus people can communicate
in a more meaningful way with other peoples avatars.
Distributed VR embodies all the applications we have already considered,
because it is able to take them one step further. Take for example VR and
simulations. The biggest advantage that VR has over typical simulations is
that VR allows multiple users. With distributed VR, we can have multiple
users connected up at different sites globally. Thus we can have distributed
VR training facilities, distributed entertainment (to a degree, we have had this
for some time if we think about the online computer-gaming industry, where
multiple users connect to the same game and compete against each other) and
distributed visualization (again, this is really an extension of teleconferencing,
but here people can enter into the meaningful simulations from multiple sites
throughout the world).
Of course, before distributed VR really takes off, there are a number of
problems that must be resolved, including:
Compatibility. This refers to the fact that people may be using different
hardware and software at their host sites.
Limited bandwidth. If we require global connections then we have to
rely on the Internet, and as such the realism of the VR connection will
depend on the bandwidth available at the host sites.
Latency. Again, with global connections between the host sites, we have
to rely on Internet communication protocols. As such, it is difficult
to ensure the real-time delivery of information simultaneously to each
hosts site.
It wont be long before these bottlenecks are resolved and then distributed VR
will know no bounds.
i
i
3.7
39
The Implications of VR
Weiss [15] tells us that whilst there has been considerable speculation and
enthusiasm among possible users of VR technology regarding its potential,
much of it is still blue sky thinking rather than actual applications. So, as we
have seen, VR has considerable potential for many applications, but it may
not be appropriate or desirable in some cases. There are a number of reasons
for this. Primarily, VR is currently still very expensive as a development tool
and it may not be necessary. For designers, CAD (computer-aided design)
may deliver everything a designer needs without the complications of VR.
In addition, limited research has been done to prove that there is effective
transfer of skills developed in the virtual environment to the real world. It is
this issue that we would like to discuss in more detail.
Critical to using any VR simulator for training is being able to assess
how the skills leant in the virtual environment will be transferred to the real
world. Rose tells us that within the VR training literature, there is a wealth of
anecdotal evidence that transfer does occur. However, Rose also insists that
there have been relatively few attempts to investigate empirically the virtual to
real world transfer process with regards to what sort of training shows transfer,
in what conditions, to what extent, and how robust that transfer is [10].
Take for example the case where VR is used to desensitize people to fears
and phobias. The Virtual Reality Medical Center in California currently uses
VR exposure therapy in combination with physiological monitoring and feedback to treat panic and anxiety disorders. Social phobia is the most common
form of anxiety disorder, and the single greatest fear that seems to exist worldwide is that of public speaking. We can overcome these fears by practicing
public speaking, and by utilizing a number of calming and breathing techniques. VR is used in many instances to provide simulations for practicing
public speaking, and in essence the user becomes less sensitive to his fear or
phobia. This lessening of anxiety can be an important asset in enabling people
to maintain their cool under duress, but it may also lead to a loss of respect
for a real-life danger, particularly where the hazard is experienced in a game
format [15]. So if this suggests that practicing or training for certain situations invariably leaves us feeling less anxious about the real event then lets
look at the bigger picture. Some researchers have been trying to determine
the effectiveness of transferring the idea of risk through VR [6]. For example,
Mitchell [6] believes that if the simulation is not entirely credible then the
user departs from her VR experience with a lower perception of the hazards
involved. This can be detrimental to the soldier training in a war game simu-
i
i
40
lation, the surgeon practicing her surgery skills and so on. That is, they have
become desensitized to the real-world dangers.
So perhaps what we can say here is that VR is useful for some applications
but not for all. For example, Stanney suggests that we need to determine the
tasks that are most suitable for users to perform in VR. Of course, the implication here is that not all tasks are suitable for VR. Indeed, Stanney [13] has
suggested that an understanding of the human factor issues can provide a systematic basis by which to direct future VR research. Because VR is all about
trying to trick the human user into believing he is in a different environment,
the best and indeed only real way to assess the efficiency of VR is to determine
how much of an impact VR has on the user. Once we can determine this, we
can improve the VR experience.
They suggested the following key human factors which determine the
effectiveness of VR.
i
i
41
freedom to commit rape and murder within VR. Obviously, this is an ethical
rather than technical issue. It is technically possible to construct VR in such
a way that almost every possibility of the users imagination can be fulfilled.
It is also possible for designers to place arbitrary limits on what is possible
within a particular VR. That is, the VR system can be designed and built in
such a way to prevent people from carrying out immoral acts. Whitby goes
on to argue that there are four very real reasons as to why restrictions should
be placed in VR. These are [16]:
1. People acting out scenarios in VR might then do it for real. This may not
be as crazy as first seems. We all know that crime is on the rise in the
western world, and for some time people have been linking this rise to
the rise in crime shown on popular TV.
2. Some things are not acceptable even in private. This point indicates that
if a person commits a morally unacceptable act in VR, because it
doesnt affect anyone, can it really be seen as unethical? Of course,
the answer is yes, because every person has an ethical duty to themselves.
3. People may end up preferring the virtual world to the real. Again we need
only remember the world of the Internet and cyber dating to know that
this argument could well have substance.
4. The designers of VR can signal social approval or disapproval. Essentially,
designers of VR have the ability to reward ethical behavior and prevent
unethical behavior. For example, look at some computer games on the
market. They require their users to commit murder in order to advance
to the next level of the game. Here they are obviously being rewarded
for unethical behavior. And again, some psychologists would question
the impact this has on our youth.
Well, all this looks a bit depressing for VR! We wont delve any deeper into
the ethical issues of the technologywe just wanted you to know that they
do exist.
But on to some more bad news for VR. Lets not forget that the use
of VR technology may also have some health concerns for the individuals
involved. For example, performing tasks in a virtual environment may require
increased concentration compared to performing similar tasks in the natural
world [3]. This may affect heart rate and blood pressure, for example, which
may adversely affect some people and not others. And it is also known that
i
i
42
certain values of lag time1 in a visual display can cause disorientation, which
is now termed vision-induced motion sickness (VIMS). In fact, most people
cannot use VR systems for more than 15 minutes without feeling the effects
of VIMS. But again, as graphic processing units become more powerful, lag
time reduces, and as such VIMS will become a less problematic phenomena
of using VR.
There you have it! The negative effects of VR encapsulated in a few
paragraphs. Of course, we should mention that if the technology is used
in a responsible manner then the technology really does have the means to
improve peoples lifestyles and work styles. Just think of the automobile. So
its not all bad!
3.8
Summary
As you have seen, there is an enormous potential for VR technology to impact upon numerous industries, educational establishments and various applications. Examples include the use of VR in desensitization training with
patients with phobia, in treating children with autism, in training people
with learning difficulties and for rehabilitation of patients. It is impossible to
cover all the potential applications for VR and we have not tried. What we
have done is try to give you a broad sample of applications currently being
researched and developed, to give you an idea of the breadth and flexibility
of VR.
However, it is also worth noting that most potential applications for VR
are still very much in the development stage. A wide and varied range of tests
and development scenarios must be dreamt up and implemented to verify
whether VR is useful for the application, and if so to what extent and whether
it justifies the cost. And then of course we need to enter into the arena of
human effects and physiological impact. This requires a whole new set of
skills than those possessed by the engineer or computer scientist who can
design and build the VR scenarios. Thus we enter into the realms of crossdisciplinary work, and this adds a whole new set of complications to the
decision of whether to use VR or not.
Added to all that, we need not forgot that the equipment required for
our VR scenarios is still not as highly developed as we would like it to be.
And of course VR is just like any other technology when it comes to market1
Lag time in this context refers to the time it takes for the visual display to be updated in
order to reflect the movement of the VR user.
i
i
Bibliography
43
Bibliography
[1] C. Cruz-Neira et al. VIBE: A Virtual Biomolecular Environment for Interactive Molecular Modeling. Journal of Computers and Chemistry 20:4 (1996)
469477.
[2] B. Delaney. Visualization in Urban Planning: They Didnt Build LA in a Day.
IEEE Computer Graphics and Applications 20:3 (2000) 1016.
[3] R. C. Eberhart and P. N. Kzakevich. Determining Physiological Effects of
Using VR Equipment. In Proceedings of the Conference on Virtual Reality and
Persons with Disabilities, pp. 4749. Northridge, CA: Center on Disabilities,
1993.
[4] D. Fallman, A. Backman and K. Holmund. VR in Education: An Introduction to Multisensory Constructivist Learning Environments. In Proceedings
of Conference on University Pedagogy. Available online (http://www.informatik.
umu.se/dfallman/resources/Fallman VRIE.rtf), 1999.
[5] M. Macedonia. Entertainment Technology and Virtual Environments for
Military Training and Education. In Forum Futures 2001. Available online
(http://www.educause.edu/ir/library/pdf/ffp0107s.pdf). Cambridge, MA: Forum for the Future of Higher Education, 2001.
[6] J. T. Mitchell. Can Hazard Risk be Communicated through a Virtual Experience? Disasters 21:3 (1997) 258266.
[7] S. Pieper et al. Surgical Simulation: From Computer-Aided Design to
Computer-Aided Surgery. In Proceedings of Medicine Meets Virtual Reality.
Amsterdam: IOS Press, 1992.
i
i
44
[8] G. Robillard et al. Anxiety and Presence during VR Immersion: A Comparative Study of the Reactions of Phobic and Non-Phobic Participants in Therapeutic Virtual Environments Derived from Computer Games. CyberPsychology
& Behavior 6:5 (2003) 467476.
[9] B. Rothlein. Crashes Which Dont Leave Any Dents. BMW Magazine (2003)
8487. Available online (http://www.sgi.com/pdfs/3374.pdf).
[10] F. D. Rose et al. Transfer of Training from Virtual to Real Environments.
In Proceedings of the 2nd Euro. Conf. on Disability, Virtual Reality and Assoc.
Technology, pp. 6975. Reading, UK: ECDVRAT and University of Reading,
1998.
[11] S. Russell et al. Three-Dimensional CT Virtual Endoscopy in the Detection
of Simulated Tumors in a Novel Phantom Bladder and Ureter Model. Journal
of Endourology 19:2 (2005) 188192.
[12] V. Sastry et al. A Virtual Environment for Naval Flight Deck Operations Training. In NATO Research and Technology Organization Meeting Proceedings, MP058, 2001.
[13] K. Stanney et al. Human Factors Issues in Virtual Environments: A Review of
the Literature. Presence 7:4 (1998) 327351.
[14] B. Subaih et al. A Collaborative Virtual Training Architecture for Investigating
the Aftermath of Vehicle Accidents. In Proceedings of Middle Eastern Symposium on Simulation and Modeling, 2004.
[15] P. Weiss et al. Virtual Reality Applications to Work. WORK 11:3 (1998)
277293.
[16] B. R. Whitby. The Virtual Sky Is not the Limit: Ethics in Virtual Reality.
Intelligent Tutoring Media 4:1 (1993) 2328.
[17] B. Ulicny. Crowdbrush: Interactive Authoring of Real-Time Crowd Scenes.
In Proc. ACM SIGGRAPH/Eurographics Symposium on Computer Animation,
pp. 243252. Aire-la-Ville, Switzerland: Eurographics Association, 2004.
i
i
Building a
Practical VR
System
i
i
46
goal for any interactive VR environment is to provide devices which allow the
user to manipulate virtual objects in the same way. In most tasks performed by
a computer, the conventional keyboard and 2D mouse have proven to be an
adequate form of input. However, the 3D dexterity that we normally expect
when using our hands (along with their sense of touch), requires capabilities
which go beyond that which can be achieved using these conventional forms
of input.
Thus for interaction in our VR world, we need an input device which can
sense the position of our hands at any location in the 3D real world. Sometimes, we might also like to sense the result of any action we undertake in the
virtual world. In this case, we need to use a combined 3D input and output
device. This is more commonly referred to as a haptic or force-feedback device. For these input and/or output devices to work, it is essential to be able
to connect the real and virtual worlds. That is, not only do we need to be able
to acquire input from basic devices such as mice and keyboards (that actively
send 2D data to VR software), but for more advanced VR systems, we must
also consider how a users movement in the real world can be sensed and acted
upon in the virtual world to allow for real-time immersion. That is, we need
3D user interfaces which can sense things in the real world and acquire 3D
information on position and orientation of the user. The hardware associated
with 3D user interfaces is such a vast subject that it forms a major part of the
book by Bowman et al. [6], and this would make worthwhile reading if you
are contemplating creating your own VR system. The book comprehensively
examines the whole subject of 3D user interfaces. In particular, Chapter 4
gives a wide-ranging review of 3D input hardware including tracking devices,
Chapter 6 introduces the reader to some novel solutions for interfacing to the
virtual traveling experience and Chapter 3 describes the range of devices and
technologies used to present the virtual world to the real visitor.
As we have seen in Chapter 2, for a VR system to have any added value, it
must interface with as many of the senses as possible. Principally, vision (including stereopsis and high dynamic range lighting) and touch (including our
sense of pushing and pulling, i.e., force), and in an abstract way, just our sense
of being there. So whether you are building a system which will have your user
working at a desktop, immersed in a computer-assisted virtual environment
(cave1 ) or wearing a head-mounted display, there is no escaping the fact that
1
The term cave has achieved some popular usage to describe a virtual environment where
the walls of a room or structure that encloses or partially encloses the VR system user also acts
as a display surface.
i
i
47
4.1
Technology of Visualization
The central component of any VR system, whether it be desktop or immersive, is the display. The display is driven by the graphics adapter. Graphics
adapters are high-powered computing engines (this is discussed in detail in
Chapter 13) which usually provide at least two outputs. These outputs can
either be connected to a some form of monitor (or multiple monitors) for
desktop VR or to a projection system (again with one or more projectors) for
an immersive VR system.
4.1.1
Desktop VR
There are three types of desktop display technology in common use today:
cathode ray tube (CRT), liquid crystal display (LCD) or plasma. Which one
is best suited to a particular application will depend on a number of factors, e.g., brightness, size, resolution, field of view, refresh rate and stereopsis,
which is discussed in detail in Chapter 10. We will briefly look at the construction of each of these display technologies and how that influences their
use. In addition, an interesting experimental aspect of a desktop display involves the possible use of projected images formed on solid shapes that sit in
front of the user. We will illustrate this in the section on desktop projection,
but first we turn to the conventional displays.
i
i
48
i
i
49
Looking at Figure 4.1, one sees that the CRT is based on an evacuated glass container (Figure 4.1(a)) with a wire coil at one end and a
fine metal grid at the other. When the coil is heated, a gas of charged
particles (electrons) is generated. Applying a voltage of more than 25
kV between the coil (the cathode) and the metal grid (the anode) accelerates a beam of electrons from cathode to anode. When they hit
the screen at the anode, a layer of phosphorescent material emits light.
Magnetic fields induced by coils around the tube bend the beam of
electrons. With appropriate driving currents in the field generating
coils, the electron beam scans across and down the screen. Changing
the voltage of a grid in the electron path allows fewer electrons in the
beam to reach the phosphor, and less light is generated. Figure 4.1(b)
depicts the color displays use of three phosphors to emit red, green
and blue (RGB) light. Three electron guns excite the color phosphors
to emit a mix of RGB light. A shadow mask makes sure that the only
electrons from an R/G/B gun excite the R/G/B phosphors. In Figure 4.1(c), the analog signal that drives the CRT has special pulses
which synchronize the line and frame scanning circuitry so that the
position of the beam on the screen matches the pixel value being read
from the graphics adapters video memory.
LCDs. Liquid crystal displays are fast becoming the norm for desktop use, as they are very compact. But, with the exception of a few
specialized devices, they cannot exceed refresh rates of 75 Hz. This
Figure 4.1. The CRT (a) layout, (b) screen and (c) driver waveform.
i
i
50
makes most LCDs unsuitable for stereopsis. The best technology currently available for manufacture of desktop and laptop LCD panels is
the back-lit, thin film transistor, active matrix, twisted nematic liquid
crystal described in Figure 4.2. The main problem with LCD film is
that the color can only be accurately represented when viewed straight
on. The further away you are from a perpendicular viewing angle, the
more the color will appear washed out. However, now that they can
be made in quite large panel sizes and their cost is dropping rapidly,
several panels can be brought into very close proximity to give a good
approximation of a large wide-screen desktop display.
As illustrated in Figure 4.2, the LCD screen is based on a matrix (i, j)
of small cells containing a liquid crystal film. When a voltage is applied
across the cell, a commensurate change in transparency lets light shine
through. Color is achieved by using groups of three cells and a filter
for either red, green or blue light. Each cell corresponds to image pixel
(i, j) and is driven by a transistor switch to address the cells as shown
in Figure 4.2(a). A voltage to give the required transparency in cell
(i, j) is set on the column data line i. By applying another voltage to
row address line j, the transistor gate opens and the column signal is
applied to the LCD cell (i, j). The structure of the LCD cell is depicted
in Figure 4.2(b). The back and front surfaces are coated with a 90
phased polarizing layers. This would normally block out the light.
However, when no voltage is applied across the cell, the liquid crystals
tend to line up with grooves in the glass that are cut parallel to the
glass
polarizer
v>0
v=0
(a)
transistor
LCD cell
(b)
transparent
opaque
Figure 4.2. The operation of an LCD screen: (a) electrical layout, (b) physical layout.
i
i
51
i
i
52
Figure 4.3. The structure of a plasma display panel and cell cross-section.
across cell (i, j) is high enough, the gas will ionize and create a plasma.
Plasma is a highly energetic state of matter that gives off ultraviolet
(UV) photons of light as it returns to its base state. The UV photons
excite a phosphor coating on the cell walls to emit visible light. Some
cells are coated with a phosphor that emits either red, green or blue
light. In this way, a color image can be generated. The plasma generating reaction is initiated many times per second in each cell. Different
visible light intensity in a cell is generated by changing the number of
times per second that the plasma is excited.
Desktop Projection
Projection is not normally something one thinks about when considering
desktop VR. But if you want something to really stand out then it should
be. In situations where binocular stereopsis does not provide a sufficient simulation of depth perception, or you really do want to look around the back
of something, projecting an image of a virtual object onto an approximately
shaped surface or even onto panes of coated glass may provide an alternative.
Some work on this has been done under the topic of image-based illumination [15]. The basic idea is illustrated in Figure 4.4, where a model (possibly
clay/wood/plastic) of an approximate shape is placed on a tabletop and illuminated from three sides by projectors with appropriate images, possibly
animated sequences of images. This shows surface detail on the blank shape
and appears very realistic. Architectural models are a good candidate for this
type of display, as are fragile or valuable museum exhibits, which can be realistically displayed on open public view in this manner. Some distortion may
be introduced in the projection, but that can be corrected as illustrated in
Figure 4.10.
i
i
53
If you are projecting onto a surface, there are a couple of issues that need
to be addressed:
The number of projectors it takes to provide the illumination. When projecting onto a surface that is primarily convex, good coverage will be
achieved with the use of three projectors.
Images which are being projected may need to be distorted. If the surface
of projection is not flat and perpendicular to the direction of projection then any projected image will be distorted. The distortion may
be relatively simple, such as the keystone effect, or it may be very nonlinear. Whether to go to a lot of trouble to try to undistort the image
is a difficult question to answer. We will examine distortion in detail
shortly, but for cases where different shapes are being projected onto a
generic model (for example, different textures and colors of fabric projected onto a white cloth), imperfections in alignment may not be so
important.
4.1.2
Immersive VR
Figure 4.4. Projection of a image of a wine bottle onto a white cylinder gives the
illusion of a real wine bottle being present in the room.
i
i
54
the cave and the head-mounted display (HMD). However, neither are perfect
solutions.
Head-Mounted Displays
An HMD is essentially a device that a person can wear on her head in order
to have images or video information directly displayed in front of her eyes.
As such, HMDs are useful for simulating virtual environments. A typical
HMD consists of two miniature display screens (positioned in front of the
users eyes) and an optical system that transmits the images from the screens to
the eyes, thereby presenting a stereo view of a virtual world. A motion tracker
can also continuously measure the position and orientation of the users head
in order to allow the image-generating computer to adjust the scene representation to match the users direction of view. As a result, the viewer can look
around and walk through the surrounding virtual environment. To allow this
to happen, the frame refresh rate must be high enough to allow our eyes to
blend together the individual frames into the illusion of motion and limit the
sense of latency between movements of the head and body and regeneration
of the scene.
The basic parameters that are used to describe the performance of an
HMD are:
Field of view. Field of view is defined as the angular size of the image as
seen by the user. It is usually defined by the angular size of the diagonal
of the image.
Resolution. The quality of the image is determined by the quality of the
optics and the optical design of the HMD, but also by the resolution of
the display. The relation between the number of pixels on the display
and the size of the FOV will determine how grainy the image appears.
People with 20/20 vision are able to resolve to 1 minute of an arc. The
average angle that a pixel subtends can be calculated by dividing the
number of pixels by the FOV. For example, for a display with 1280
pixels in the horizontal plane, a horizontal FOV of 21.3 will give 1
pixel per minute of arc. If the FOV is increased beyond this, the image
will appear grainy.
Luminance. Luminance, or how bright the image appears, is very important for semi-transparent HMD systems which are now being used
to overlay virtual data onto the users view of the outside world. In this
case, it is important that the data is bright enough to be seen over the
light from the ambient scene.
i
i
55
Eye relief and exit pupil. Eye relief is the distance of the eye from the
nearest component of the HMD. This is shown in Figure 4.5. The size
of the eye relief is often dependent on whether the user is required to
keep his eyeglasses on, as this requires extra space between the HMD
and the eye. An eye relief of 25 mm is usually accepted to be the
minimum for use with eyeglasses. If the HMD is focusable such that
eyeglasses do not need to be worn then the eye relief can be less. The
exit pupil is the area where the eye can be placed in order to see the
full display. If the eye is outside the exit pupil then the full display will
not be visible. Generally, the greater the eye relief, the smaller the exit
pupil will be.
An HMD can be for one or two eyes. A one-eye system is known as a
monocular system, while a two-eye system is known as biocular or binocular:
Monocular display. A monocular system uses one display for only one
eye. This is the simplest type of system to build, as there is no requirement to match what each eye sees. This type of system generally has a
limited FOV, as it is not comfortable for one eye to be scanning over a
wide FOV while the other eye is not.
Biocular systems. A biocular system is one in which both eyes see the
same image. This can be achieved using two displays with one set of
electronics to show the same image on both displays.
Binocular systems. A binocular system is one in which each eye sees a
different image. For this device, two displays and two sets of electronics
are required. A binocular system is required for stereoscopic viewing.
Both biocular and binocular systems are more complicated than monocular systems, as they require the displays to be separated by a specific distance
Figure 4.5. Relationship between the HMD eye relief and exit pupil size.
i
i
56
known as the interpupillary distance (IPD, the distance between the eyes).
Binocular systems also require the images to be overlapped in order to produce a stereoscopic view. Today, binocular systems are more common.
At first sight, the HMD seems to offer the greatest promise, because it
should be cheaper, more compact, very transportable and allows every user
to be enveloped in her own little world. However, unless the HMD offers
a see-through capability, the users cant even see their own hands. Even seethrough HMDs do not provide a complete solution, because as soon as you
can see through the virtual elements to see your hands, you will also be able
to see other elements of the real environment such as the floor or walls. That
may not be what you wantyou might want the HMD wearer to believe she
is standing in the middle of a football stadium. Most HMDs are also quite
bulky, and because wireless links between the HMD and graphics processor
are rare, the user is also constrained to within a few meters of the computer
controlling the environment.
In the longer term, we have no doubt that the technical problems of
miniaturization and realistic display will come to the HMD, just as accurate
position sensing and photo-quality real-time rendering are with us now. This
will not mean the end of the cave, for the reason alluded to above, i.e., the
need to see some real-world elements (your hands, other people etc.). In the
mid-term future, we have no doubt that a mix of see-through HMDs and
high brightness stereoscopic enabled caves will open the door to amazing virtual/augmented/synthetic realitiesbut as of now, we believe the cave offers
the better answer.
i
i
57
that shower cubicle. If the cave is too small, one might as well not bother
even starting to build it. Of course, one might be able to call on a little mechanical engineering and build a moving floor. Possibly something as simple
as a treadmill might do the trick. Unfortunately, a treadmill is an individual
thing, and making a cave for individual use somewhat negates the ability for
group work, which is one of the major advantages of cave applications.
Figure 4.6 illustrates some of the key points of cave design. First amongst
them is whether you use front or back projection or build the cave from video
walls. An immersive cave can take on almost any shape. In Figure 4.6 (left),
a back projected cubic system is depicted. Most low-cost projectors have a
built-in keystone correction (as seen in the lower diagram) which allows them
to be positioned near the top of the wall. In a front-projection system, such as
the cylindrical configuration in Figure 4.6(center), the keystone effect helps
to minimize the problem of shadow casting. A cave does not have to be either
cubic or cylindrical in shape. For exhibitions or special presentations, several
projectors can be linked, as in Figure 4.6 (right) to present continuous images
over a complex shape. Several examples of this type of display may be found
in [16].
Front projection saves on space, and with most projectors able to throw
an asymmetrical keystone projection, say from a point at roof level,
anyone working in the cave will only start to cast a shadow on the
display as they approach the screen.
Back projection does not suffer from the shadow casting problem, but
the construction of the walls of the cave have to be thin enough and
i
i
58
made of appropriate materials to let the light through into the working
volume. There may also be a problem of how the walls of the cave are
constructed, because the supports may cast shadows.
The video wall offers a lot of flexibility because separate tiles in the wall
can carry individual displays if required. LCD and Plasma displays are
getting bigger and have very small borders, so a good approximation to
a whole wall in a cave could be made from as few as four displays. A
system like this would have the major advantages of being very bright,
occupying little space and not suffering from the shadow problem. Disadvantages include the cost and need for cooling.
The second issue involved with a cave design is the shape of the walls;
there are pros and cons with this too:
A cuboid shape is probably the most efficient use of space for a cave
system, and it is relatively easy to form the necessary projections. This
is especially true if the roof and floor of the room are ignored.
A cylindrical shape for the walls of a cave has the advantage of being
a seamless structure, and so any apparent corner where the walls meet
is avoided. Unfortunately, a conventional projector cannot present an
undistorted image over a curved screen. This is most noticeable when
several projector images must be stitched together to make up the full
cave. We will discuss in detail how to build and configure a semicircular cave, which has the advantage of still giving a sense of immersion even though front projection and multiple images are stitched
together.
Spherical cave systems could have the potential of giving the most realistic projections. Most often, they are implemented for use by a single
person and have an innovative projection system design that requires
only one projector.
A cave does not just have to consist of cubic or cylindrical working
volume. Multiple projectors can be set up to display images on any
working environment. Figure 4.6 (right) depicts an environment that
might typically be used in exhibition halls and for trade shows.
i
i
59
light processor
to give even
polarised white
light
LCD
cells
projector
lens
lamp
dichroic
combining
prism
reflector
dichroic mirrors
Figure 4.7. LCD projectors typically use three LCD cell matrices. Dichroic mirrors
separate white light into primary colors, which pass through each of the LCDs. LCDs
in projectors use active matrix cells, but they must be much smaller than the cells in
a desktop LCD monitor. (w indicates white light, r red light component etc.)
i
i
60
DLP projectors have high refresh rates and deliver a high resolution. At
the core of DLP projectors is the Texas Instruments MEMS (microelectromechanical system) chip and the digital micromirror device (DMD). Figures 4.8
and 4.9 illustrate the DMD and how DLP projectors work. The word mirror in DMD gives a vital clue to the idea behind the DLP. The DMD chip
is an array of microscopic mirrors, one mirror for each pixel in the image.
The microscopic mirrors are attached to the chips substrate by a small hinge,
and so it can be twisted into two positions. In one of these positions, it is
angled in such a way that light from the projectors lamp is reflected through
the projectors lens and onto the screen. In the other position, it is reflected
away from the lens and no light then appears on the screen. Thus, we have
a binary state for the projection of each pixel. By flicking the micromirror
back and forth hundreds of times a second, intermediate shades of gray can
be simulated. Color is easily added by using filters and three DMDs for the
RGB primaries. For a detailed description of the structure of the DMD and
a comparison of its performance against LCD and other display technology,
see Hornbeck [9].
Figure 4.8. (a) The Texas Instruments DMD chip. (b) Close up of the micromirrors.
(c) 3D impression of the micromirror structure. (d) Micrograph of the mechanism
without mirror. (e) Principle of operation. When a pixel is on, the mirror reflects
the light into the lens. When a pixel is off, the light is reflected onto a black body
absorber. (Images courtesy of Texas Instruments.)
i
i
61
Figure 4.9. DLP projectors are built with either one or three DMD chips. In the
single-chip projector, a color filter on a rotating wheel passes red, green and blue filters in the light path. In the three chip projector, a dichroic prism filters the light into
primary colors, which are processed separately and reflected into the lens. (Images
courtesy of Texas Instruments.)
i
i
62
Figure 4.10. (a) Projecting onto a curved surface distorts the displayed pictures. (b)
left side of the display and the other the right side (Figure 4.10(b)). It
is also possible to configure networked computers so that each node on
the network renders one piece of projection. More programming effort
is required to make sure the renderer synchronizes the views it is generating. Synchronizing the video signals sent to the projectors is vital
for stereoscopic imaging, and quite a few of the stereo-ready graphics
adapters offer an interface to facilitate this. In fact, the primary use of
these adapters is in synchronizing rendering for ultra high resolution
displays.
Distortion. Unless the direction of projection is perpendicular to the
screen, the display will be distorted. You are probably familiar with
the idea of correcting for keystone distortion. However, if we are using
two projectors as shown in Figure 4.10(c) then we will have to arrange
that our software corrects for the fact that the projection is not at right
angles to the screen.2
2
Projecting two images in the way depicted in Figure 4.10(c) is often used to minimize the
shadow casting effect that arises in front projection. Because the beams are directed from the
side, a person standing in the middle will not cast a shadow on the screen. This is especially
true if we allow for some overlap.
i
i
63
The problem becomes even more complex if the screen is curved [21].
In a semi-cylindrical cave, a projection will exhibit distortion, as shown
in Figure 4.11(a). To overcome this, the rendering engine is programmed to negate the effect of the distortion (Figure 4.11(b)). We
show how to do this in practice in Section 18.5. We can use the same
idea to correct for all sorts of distortion. In a multi-projector cave,
aligning adjacent images is made much easier by interactively adjusting
the software anti-distortion mechanism rather than trying to mechanically point projectors in exactly the right direction (it also helps when
building a portable cave).
Blending. In theory, projecting an image as a set of small tiles (one tile
per projector) should appear seamless, provided the distortion has been
corrected. Unfortunately, in practice, there is always this annoying little gap. Often it arises because of minor imperfections in the display
screen; a cylindrical shape approximated by small planar sections, for
example. This annoying artefact can be removed by arranging that the
tiles have small overlapped borders. Inside the border area, brightness is
gradually reduced in one tile while it is increased in the other. Having
to build tiles with overlapping edges complicates the projection software drivers, but as Figure 4.12 demonstrates, large panoramic images
Figure 4.11. (a) A panoramic projection from three sources gives rise to nonlinear
distortion on a curved screen. To correct for the distortion, the projected images
are pre-processed. The image in (b) shows the changes made to the projections
to correct for the distortion in (a). After pre-processing, the nonlinear distortion is
greatly reduced (c). (The rectangular grid in the images is used to manually configure
the pre-processor.)
i
i
64
Figure 4.12. When projecting onto a non-flat surface, the edges rarely line up. Small
imperfections are very noticeable to the human eye, so a better approach is to overlap
the adjacent projections and gradually blend between them. In (a) we see the result
of including a small overlap. By applying a gradual blend (using a sigmoid function
(c)) from one projection to the other, the area of overlap becomes indistinguishable
from the rest of the projection (b).
can be obtained using tiled sections that overlap and blend to make the
join appear seamless.
Figure 4.13 illustrates a 180 cylindrical stereoscopic panoramic display
built using low-cost components. It is driven by a pair of PCs, each sending
jectors are driven by two networked computers, each with dual video outputs. The
system uses custom display software and can be interactively configured to correct
for distortion.
i
i
65
video output to two projectors. The graphics processors are fast enough to run
the distortion-correction software without the need for any special hardware.
4.2
Once the basic decision has been made about the type of display technology
that will be utilized for the virtual environment, you also have to decide how
to include interaction with the other senses:
The sound system is relatively easy to provide and control. A number
of software libraries, for example Microsofts DirectSound [13], allow
application programs to easily render 3D soundscapes which match the
graphics.
We will see in Chapter 10 that the sense of stereopsis adds greatly to
the perception of reality. The commonly used method of providing this
effect is by using infrared (IR) controlled shutter eyewear. This technology requires the IR signal to be received by the eyewear at all times,
and so the cave designer must install a sufficient number of emitters to
allow all the users of the cave to pick up the signal.
Head-mounted display VR systems must employ motion tracking.
Without it, the correct viewpoint and orientation cannot be generated. In a cave, it might at first sight not seem so important to know
where the observers are. Nevertheless, the global component (the cave)
and the personal component (the head-mounted display) are likely to
merge in the future and a good cave designer should recognize this and
include some position-sensing technology. One way to do this is by
including an ultrasonic emitter into the roof and walls of the cave, as
described in Section 4.3.
In the introduction to this chapter, we discussed the importance of user
interaction through input or combined input/output devices. Within
the desktop, you have the luxury of working within a relatively small
working volume. In a cave, this is not the case, and as such, natural
interaction is much more difficult to achieve.
We feel that touch is next most important sense (after vision) for a user
to engage with whilst within the virtual environment. As such, the remaining
i
i
66
sections will describe the current state of the art in input devices which allow
us to manipulate objects within the virtual environment and output devices
that allow us to sense the reaction to any changes we made. That is, they
stimulate our sense of touch. These are commonly referred to as haptic devices,
and before we describe both the input and haptic output devices, we thought
we might give you a crash course in the theory of haptics.
4.2.1
Haptics
In Section 2.3, we discovered that the human haptic system has an important role to play in human interaction with VR. But we have also seen that
whilst haptic technology promises much, it comes with many unresolved
problems and complexities. State of the art haptic interfaces are still rather
crude.
However, it is a rapidly developing topic with new ideas emerging regularly from research labs. Srinivasan and Basdagon [18] and Salisbury
et al. [17] provide interesting general details of how haptic systems are classified and the challenges of trying to build haptic hardware interfaces. In
designing a VR system, and especially a cave-type system, we would ideally
like our haptic device to be able to work at long range, not get in the way,
and offer the ability to appear to pick up a virtual object and get a sense of its
weight. By extrapolation, if we could do that, most of what we would want
to do with haptics would be possible.
Following [18], a useful way to describe a haptic device is in the way it is
attached to a fixed location.
Floating devices are things such as gloves which can perform inter-digit
tasks. They can also measure finger contacts and finger-specific resistance. They cannot however measure or reproduce absolute weight or
the inertial effects of a virtual object. They may only be attached to
their base location via wireless link.
Exoskeleton devices, typically worn on a hand, arm or leg, may have
motorized devices that can resist certain motions and restrict the number of degrees of freedom. For example, such a device may prevent you
from closing your hand too tightly around a virtual ball as you attempt
to pick it up. This type of haptic device, like the first, does not allow
you to experience a weight effect.
i
i
67
Grounded devices behave rather like the arm of a small robot. They can
be powerful and simulate multi-axial forces, including everything from
a solid wall to picking up a virtual object and assessing its weight. They
have the disadvantage of being bulky and very restricted in the volume
in which they can move.
In the real world, we can make quite a range of movements, termed degrees of freedom (DOF). In three dimensions, six DOF provides a complete
specification of position and orientation. As such, a haptic device need only
specify a force along three mutually perpendicular axes, but that force could
depend on position and angle at the point of contact. Many haptic devices,
such as the force-feedback joysticks used in computer games, can only simulate a force along two axes and at a single point of contact (since the joystick
does not move).
An ideal haptic device has to have some formidable properties: low inertia
and minimal friction, with device kinematics organized so that free motion
can be simulated and have the ability to simulate inertia, friction and stiffness.
These devices can achieve tactile stimulation in a number of different ways:
Mechanical springs activated by solenoid, piezoelectric crystal and shape memory alloy technologies.
Vibrations from voice coils.
Pressure from pneumatic systems.
Heat-pump systems.
However, these are all still technologies under active research, and most haptic
feedback is offered via joystick or data gloves. In addition, a haptic device is
obviously useless if it does not have haptic-enabled VR software controlling
it. When developing this type of software, it is helpful to imagine haptics
operating in an analogous way to the 3D graphics with which we are familiar.
The term haptic rendering says it all. One specifies a haptic surface; it could
consist of the same shapes and even be specified in the same way with planar
polygons that the 3D renderer uses to visualize the shapes for us. Continuing
with the analogy, we can texture the surface by describing its rigidity, friction
etc. just as we do in terms of color, reflectivity etc. for the visuals.
Figure 4.14 is a block outline of the structure of the haptic-rendering
algorithm. The gray box shows the main elements in the haptic-rendering
loop. The device sends information about the position of its point of contact.
i
i
68
i
i
69
4.2.2
Desktop Interaction
In a desktop VR environment, the user may only require some basic interaction with the VR world, such as 3D navigation. Devices such as a trackball
or a 3D joystick can offer this facility. Other forms of input such as light
pens and drawing tablets may be only be marginally relevant for VR work.
Figure 4.15 illustrates examples of some of these devices.
Figure 4.15. A wide range of desktop devices extend the range of input possibilities. A
spaceball (a) offers the possibility to move in three dimensions. The joystick is a lowcost device with two or three degrees of freedom. Some degree of haptic feedback
can be achieved with basic joysticks (b). To give a comprehensive haptic response
requires a device such as SensAbles Phantom (c).
i
i
70
However, if the user of the desktop system also needs to experience feedback then a haptic output device is required. Most of the latest technological
advances of these devices has been as a direct result of the huge commercial
impact of computer games. What one would class as cheap force-feedback
devices (joysticks, mice, steering consoles, flight simulator yokes) are in highvolume production and readily available. If building your own VR system,
take advantage of these devices, even to the extent of tailoring your software
development to match such devices. Microsofts DirectInput application programming interface (API) software library was specifically designed with game
programmers in mind, so it contains appropriate interfaces to the software
drivers of most input devices. We provide a number of input device programming examples in Chapter 17.
There is plenty of scope for the development of custom-built haptic output devices. Flexible pressure sensors, custom motorized articulated linkages
and a host of novel devices might be called on by the imaginative person to
research possible new ways of interaction.
4.2.3
Immersive Interaction
Figure 4.16. Sensing the position of individual fingers relative to one another is a
complex task. (a) Immersion Corporations Cyberglove provides joint-angle data via
a wireless link. (Reproduced by permission of Immersion Corporation, Copyright
c 2007, Immersion Corporation. All rights reserved.) (b) The Fakespace Pinch
Glove has sensors in the fingertips to detect contact between two or more fingers.
c 2007, Fakespace
(Reproduced by permission of Fakespace Systems, Copyright
Systems.)
i
i
71
4.3
The final piece in the jigsaw of immersive VR technology concerns how position and orientation can be acquired from the real world, relative to a real
world frame of reference, and then matched to the virtual world, which has
its own coordinate system.
With the present state of hardware development, there is still no allembracing method of sensing everything we would like to be able to sense
within a real-world space. Some currently developed systems work better on
the small scale, some work better on a large scale, some work more accurately
than others and some sense things others can miss. However, there is usually a trade off between one desirable feature and another, and so how you
choose to acquire your real-world data will depend on the application you
are working on. One thing is certain: you can make a much better choice of
which technology to use if you know something of how they work, and what
advantages and disadvantages they have. We hope to give you some of this
knowledge within this section.
Following on with the broad idea of real world sensing, we can identity
three themes: motion capture, eye tracking and motion tracking. Although
really they are just the same thingfind the position of a point or points of
interest in a given volumethe present state of hardware development cannot
provide a single solution to address them at the same time. Indeed, we have
i
i
72
broken them down into these themes because a single piece of hardware tends
to work well for only one of them.
Motion capture is perhaps the most ambitious technology, given that it
attempts to follow the movement of multiple points across the surface of
a whole object. Optical or mechanical systems can be used to gather the
position of various points on a body or model as it is moved within a given
space within a given timescale. Later, these template actions can be utilized
in all sorts of different contexts to construct an animated movie of synthetic
characters who have, hopefully, realistic behavior. This is especially important
when trying to animate body movements in a natural way. Since VR systems
are more concerned with real-time interactivity, the emphasis shifts slightly
with the focus directed more towards motion tracking. It is difficult to be
precise about when motion capture differs from motion tracking and vice
versa. However, some examples include:
Continuous operation. Often motion-capture systems need downtime
for recalibration after short periods of operation.
Number of points sensed. Motion tracking may need only one or two
points, whereas motion capture may require tracking 20 points.
Accuracy. A motion-tracking system for a haptic glove may need submillimeter accuracy; a full-body motion-capture system may only need
accuracy to within a couple centimeters.
We shall consider motion capture to be the acquisition of moving points
that are to be used for post-processing of an animated sequence, whilst motion
tracking shall be considered the real-time tracking of moving points for realtime analysis. So, restricting the rest of our discussion to systems for motion
tracking, we will look at the alternative technologies to obtain measures of
position and orientation relative to a real-world coordinate system. It will be
important to assess the accuracy, reliability, rapidity of acquisition and delay
in acquisition. Rapidity and delay of acquisition are very important because,
if the virtual elements are mixed with delayed or out-of-position real elements,
the VR system user will be disturbed by it. If it does not actually make users
feel sick, it will certainly increase their level of frustration.
Currently, there are five major methods of motion tracking. However,
their common feature is that they all work either by triangulation or by measuring movement from an initial reference point, or perhaps best of all, by a
combination of the two.
i
i
4.3.1
73
Inertial Tracking
Accelerometers are used to determine position and gyroscopes to give orientation. They are typically arranged in triples along orthogonal axes, as shown
in Figure 4.17(a). An accelerometer actually measures force F ; acceleration
is determined using a known mass
Fm and Newtons second law. The position
is updated by pnew = pold +
m dtdt. Mechanically, an accelerometer is
a spring-mass scale with the effect of gravity removed. When an acceleration takes place, it drags the mass away from its rest position so that F = kx,
where x is the displacement and k is a constant based on the characteristic
of the system. In practice, springs are far too big and subject to unwanted
influences, such as shocks, and so electronic devices such as piezoelectric crystals are used to produce an electric charge that is proportional to the applied
force.
A spinning gyroscope (gyro) has the interesting property that if you try to
turn it by applying a force to one end, it will actually try to turn in a different
direction, as illustrated in Figure 4.17(b). A simple spinning gyro does not fall
under the action of gravity; it feels a force that makes it rotate about (precess)
the vertical axis y. If an attempt is made to turn the gyro about the z-axis
Figure 4.17. An inertial tracker uses three gyros and three accelerometers to sense
i
i
74
(coming out of the page, in the side view) by pushing the gyro down, the
result is an increased Coriolis force tending to make the gyro precess faster.
This extra force f can be measured in the same way as the linear acceleration force; it is related to the angular velocity = f ( f) of the gyro around
the z-axis. Integrating once gives the orientation about z relative to its
starting direction.
If a gyro is mounted in a force-sensing frame and the frame is rotated
in a direction that is not parallel with the axis of the gyros spin, the force
trying to turn the gyro will be proportional to the angular velocity. Integrating the angular velocity will give a change in orientation, and thus if
we mount three gyros along mutually orthogonal axes, we can determine
the angular orientation of anything carrying the gyros. In practice, traditional gyros, as depicted in Figure 4.17(b), are just too big for motion
tracking.
A clever alternative is to replace the spinning disc with a vibrating device that resembles a musical tuning fork. The vibrating fork is fabricated on
a microminiature scale using microelectromechanical system (MEMS) technology. It works because the in-out vibrations of the ends of the forks will
be affected by the same gyroscopic Coriolis force evident in a rotating gyro
whenever the fork is rotated around its base, as illustrated in Figure 4.18(a).
If the fork is rotated about its axis, the prongs will experience a force pushing them to vibrate perpendicular to the plane of the fork. The amplitude
of this out-of-plane vibration is proportional to the input angular rate, and
it is sensed by capacitive or inductive or piezoelectric means to measure the
angular rate.
The prongs of the tuning fork are driven by an electrostatic, electromagnetic or piezoelectric force to oscillate in the plane of the fork. This generates
an additional force on the end of the fork F = v, which occurs at right
angles to the direction of vibration and is directly related to the angular velocity with which the fork is being turned and the vector v, representing
the excited oscillation. By measuring the force and then integrating it, the
orientation can be obtained.
Together, three accelerometers and three gyros give the six measurements
we need from the real world in order to map it to the virtual one. That is,
(x, y, z) and roll, pitch and yaw. A mathematical formulation of the theory of
using gyros to measure orientation is given by Foxlin [8].
The requirement to integrate the signal in order to obtain the position
and orientation measures is the main source of error in inertial tracking. It
tends to cause drift unless the sensors are calibrated periodically. It works very
i
i
75
well over short periods of time and can work at high frequencies (that is, it
can report position often and so can cope well in fast changing environments).
Inertial tracking is best used in conjunction with one of the other methods,
because of the problem of initialization and the increases in error over time
without recalibration. Other problems with this method are electrical noise
at low frequencies, which can give the illusion of movement where there is
none, and a misalignment of the gravity correction vector. Inertial tracking
has the potential to work over very large volumes, inside buildings or in open
spaces.
4.3.2
Magnetic Tracking
Magnetic tracking gives the absolute position of one or more sensors. It can
be used to measure range and orientation. There are two types of magnetictracking technologies. They use either low-frequency AC fields or pulsed DC
Figure 4.18. (a) A tuning-fork gyro. (b) The fork is fabricated using MEMS tech-
nology in the sensing system used in the BEI Systron Donner Inertial Divisions
GyroChip technology.
i
i
76
fields which get around some of the difficulties that AC fields have when the
environment contains conducting materials. Magnetic trackers consist of a
transmitter and a receiver that both consist of three coils arranged orthogonally. The transmitter excites each of its three coils in sequence, and the
induced currents in each of the receiving coils are measured continuously, so
any one measurement for position and orientation consists of nine values.
Figure 4.19. (a) Magnetic induction in one sensing coil depends on the distance d
or d from the field generating coil and the angle it makes with the field . Three
coils can sense fields in mutually perpendicular directions. (b) Three emitter coils and
three sensing coils allow the six degrees of freedom (position (x, y, z), and orientation,
pitch, roll and yaw) to be determined.
i
i
77
The signal strength at each receiver coil falls off with the cube of the
distance from transmitter and the cosine of the angle between its axis and the
direction of the local magnetic field (see Figure 4.19).
The strength of the induced signal as measured in the three receiving
coils can be compared to the known strength of the transmitted signal to
calculate distance. By comparing the three induced signal strengths amongst
themselves, the orientation of the receiver may be determined.
Magnetic tracking sensors can feed their signals back to the data
acquisition hardware and computer interface via either a wired or a wireless link. A typical configuration will use one transmitter and up to
10 sensors.
The main disadvantages of magnetic motion tracking are the problems
caused by the high-strength magnetic field. Distortions can occur due to a
lot of ferromagnetic material in the environment, which will alter the field
and consequently result in inaccurate distances and orientations being determined. The device also tends to give less accurate readings the further
the sensors are away from the transmitter, and for sub-millimeter accuracy, a
range of 23 m may be as big as can be practically used.
4.3.3
Acoustic Tracking
i
i
78
Figure 4.20. (a) Acoustic tracking takes place by triangulation. Two emitters are
shown and give rise to a set of possible locations for the sensor lying on the circle of
intersection between the spheres whose radius depends on time of flight and speed
of sound in air. A third emitter will differentiate between these possibilities. (b)
A sophisticated system has three microphone sensors mounted together, and they
receive signals from an array of ultrasonic emitters or transponders.
i
i
79
4.3.4
Optical Tracking
i
i
80
Figure 4.21. By placing markers (the letters A, C and D) on the objects to be tracked,
it is possible to determine their locations within the scene and track their movement.
Using multiple camera views allows the tracking to continue so long as each marker is
visible in at least one cameras field of view. (Scene courtesy of Dr. B. M. Armstrong
and Mr. D. Moore.)
4.3.5
Mechanical Tracking
i
i
4.3.6
81
Location Tracking
In Section 4.3.3, it was pointed out that the theory of acoustic tracking relied on the time of flight of a sound pulse between an emitter and sensor.
The same concept applies to electromagnetic (EM) waves, but because these
travel at the speed of light, it is necessary to determine the time of flight much
more accurately. However, using EM waves in the microwave frequency band
allows tracking devices to operate over larger distances. Because these systems are a little more inaccurate than those based on hybrid inertial/acoustic
methods, we prefer to think of them as location trackers.
Perhaps one of the best-known radio-based tracking and navigation systems is the global positioning system, ubiquitously known as GPS. The GPS
(and Galileo, the European equivalent) is based on a system of Earth-orbiting
satellites [11] and may offer a useful alternative to the more limited range
tracking systems. It has the advantage of being relatively cheap and easy to
interface to handheld computers, but until the accuracy is reliably in the submillimeter range, its use remains a hypothetical question.
Back on the ground, Ubisenses Smart Space [19] location system uses
short-pulse radio technology to locate people to an accuracy of 15 cm in
three dimensions and in real time. The system does not suffer from the drawbacks of conventional radio-frequency trackers, which suffer from multipath
reflections that might lead to errors of several meters. In Smart Space, the
objects/people being tracked carry a UbiTag. This emits pulses and communicates with UbiSensors that detect the pulses and are placed around and
within the typical coverage area (usually 400 m2 ) being sensed. By using two
different algorithmsone to measure the difference in time of arrival of the
pulses at the sensors and the other to detect the angle of arrival of the pulses
at the sensorit is possible to detect positions with only two sensors.
The advantage of this system is that the short pulse duration makes it
easier to determine which are the direct signals and which arrive as a result
of echoes. The fact that the signals pass readily through walls reduces the
infrastructure overhead.
Another ground-based location tracking system that works both inside
buildings and in the open air is ABATECs Local Position Measurement
(LPM) system [1]. This uses microwave radio frequencies (56 GHz) emitted from a group of base stations that can determine the location of up to
16,000 small transponders at a rate of more than 1000 times per second
and with an accuracy of 5 cm. The base stations are connected via optical
fiber links to a hub that interfaces to a standard Linux PC.
i
i
82
These location-tracking systems might not be the best option for tracking
delicate movement or orientation, e.g., for measuring coordination between
hand and eye or walking round a small room or laboratory. However, there
are many VR applications such as depicting the deployment of employees in
an industrial plant, health-care environment or military training exercises to
which this technology is ideally suited.
4.3.7
Hybrid Tracking
Hybrid tracking systems offer one of the best options for easy-to-use, simpleto-configure, accurate and reliable motion-tracking systems. There are many
possible combinations, but one that works well involves combining inertial
and acoustic tracking. The inertial tracking component provides very rapid
motion sensing, whilst acoustic tracking provides an accurate mechanism
for initializing and calibrating the inertial system. If some of the acoustic
pulses are missed, the system can still continue sending position information
from the inertial sensor. In the system proposed by Foxlin et al. [7], a headmounted tracking system (HMTS) carries an inertial sensor calibrated from
an acoustic system. The acoustic system consists of a network of transponders
placed around the boundaries of the tracing volume (typically a large room).
The HMTS has a light emitter which sends a coded signal to trigger ultrasonic
pulses from the transponders one at a time. The time of flight for the sound
pulse from transponder to acoustic receptor (microphone) gives its distance.
With three or four received ultrasonic pulses, the position of the sensor can be
determined. Having eight or more transponders allows for missed activations,
due to the light emitter not being visible to the transponder or the response
not being heard by the sound sensor. The system has many other refinements,
such as using multiple groups of transponders and applying sophisticated signal processing (Kalman filtering) algorithms to the ultrasound responses to
eliminate noise and echoes. Figure 4.20(b) illustrates the general idea. The
advantages of this system are its robustness due to transponder redundancy,
noise and error elimination due to transponders being activated by the sensor
itself and the electronic signal processing of the ultrasonic response.
4.3.8
Commercial Systems
i
i
4.4. Summary
83
Flock of Birds (magnetic tracking). (b) InterSenses IS-900 acoustic tracking system
(Photographs courtesy of Dr. Cathy Craig, Queens University Belfast).
4.4
Summary
In this chapter, we examined the current state of the technology used to build
VR systems. It seems that VR on the desktop, for applications such as engineering design and the creative arts, is flourishing and providing a valuable
and substantially complete service. On the other hand, any system that attempts to free the virtual visitor from her seat or desktop still has a long way
to go, especially in the area of interacting with the sense of touch.
In any practical VR system, the ability to sense where things are and
what they are doing in the real world is a significant problem. It requires
i
i
84
Bibliography
[1] ABATEC Electronic AG. Local Position Measurement. http://www.lpm-world.
com/.
[2] Ascension Technology. Flock of Birds. http://www.ascension-tech.com/
products/flockofbirds.php.
[3] Ascension Technology. laserBIRD. http://www.ascension-tech.com/products/
laserbird.php.
[4] Ascension Technology. MotionStar. http://www.ascension-tech.com/products/
motionstar.php.
[5] ARToolKit. http://www.hitl.washington.edu/artoolkit/.
[6] D. A. Bowman, E. Kruijffm, J. J. LaViola and I. Poupyrev. 3D User Interfaces:
Theory and Practice. Boston, MA: Addison Wesley, 2005.
i
i
Bibliography
85
[7] E. Foxlin et al. Constellation: A Wide-Range Wireless Motion Tracking System for Augmented Reality and Virtual Reality Set Applications. In Proceedings of SIGGRAPH 98, Computer Graphics Proceedings, Annual Conference Series,
edited by M. Cohen, pp. 371378. Reading, MA: Addison Wesley, 1998.
[8] E. Foxlin et al. Miniature 6-DOF Inertial System for Tracking HMDs. In
Proceedings of SPIE Vol. 3362, Helmet and Head-Mounted Displays, AeroSense
98. Bellingham, WA: SPIE, 1998.
[9] L. J. Hornbeck. From Cathode Rays to Digital Micromirrors: A History of
Electronic Projection Display Technology. TI Technical Journal 15:3 (1998)
746.
[10] InterSense Corporation. IS-900 System. http://www.intersense.com/products.
aspx?id=45&.
[11] E. Kaplan and C. Hegarty (editors). Understanding GPS: Principles and Applications, Second Edition. Norwood, MA: Artech House, 2005.
[12] Meta Motion. Gypsy 4/5 Motion Capture System. http://www.metamotion.com/
gypsy/gypsy-motion-capture-system.htm.
[13] Microsoft Corporation. DirectSound. http://www.microsoft.com/windows/
directx/.
[14] R. Raskar. Immersive Planar Display Using Roughly Aligned Projectors. In
Proceedings of the IEEE Virtual Reality Conference, p. 109. Washington, D.C.:
IEEE Computer Society, 2000.
[15] R. Raskar. Projectors: Advanced Geometric Issues in Applications. SIGGRAPH Course. ACM SIGGRAPH, 2003.
[16] R. Raskar et. al. Multi-Projector Displays Using Camera-Based Registration.
In Proceedings of the IEEE Conference on Visualization, pp. 161168. Los Alamitos, CA: IEEE Computer Society Press, 1999.
[17] K. Salisbury et al. Haptic Rendering: Introductory Concepts. IEEE Computer
Graphics and Applications 24:2 (2004) 2432.
[18] M. A. Srinivasan and C. Basdogan. Haptics in Virtual Environments: Taxonomy, Research Status, and Challenges. Computers and Graphics 21:4 (1997)
393404.
[19] Ubisense. Smart Space. http://www.ubisense.net.
[20] UGS Jack. http://www.ugs.com/products/tecnomatix/human performance/
jack/.
i
i
86
i
i
Describing
and Storing
the VR World
There are many aspects that need to be considered when thinking about storing a practical and complete representation of a virtual world that has to
include pictures, sounds, video and 3D, such as, how much detail should we
include? How much storage space is it going to occupy? How easy is it going
to be to read? Unlike the specific case of designing a database for computer
animation programs or engineering-type computer-aided design (CAD) applications, VR involves a significant mix of data types, 3D structure, video,
sounds and even specifications of touch and feel.
In this chapter, we will explore how some of the most useful storage strategies work. That is, how they can be created by applications, how they can be
used in application programs and how they work together to enhance the virtual experience (for example, a video loses a lot of its impact if the soundtrack
is missing).
Despite the use of the word reality in the title of this book, VR is not just
about the real world as we perceive it. Often the use of VR is to present the
world as we have never perceived it before. Visualizing and interacting with
scientific data on a large projection display can and does result in surprising
discoveries. The same VR data formats that we so carefully design for recording and describing a real environment can also be used to describe the most
unreal synthetic environments, scenes and objects.
Since the world we live in and move around in is three dimensional, it
should be no surprise to hear that the most obvious aspects of a virtual world
that we have to describe and record will be a numerical description of the 3D
87
i
i
88
structure one sees, touches and moves about. Thus we will begin with some
suggestions for describing different shapes and geometry, using numbers.
5.1
In essence, there is only one way to describe the structure of a virtual world
which may be stored and manipulated by computer. That way is to represent
the surfaces of everything as a set of primitive shapes tied to locations in a
3D coordinate system by points along their edge. These 3D points are the
vertices which give shape and body to a virtual object. The visual appearance
of the structure will then depend on the properties of the primitive shapes,
with color being the most important. A polygon with three sides (a triangle)
is the simplest form of primitive shape. This primitive shape is also known
as a facet. Other primitive shapes such as spheres, cubes and cylinders are
potential shape builders. They too can be combined to build a numerical
description of an environment or of objects decorating that environment or
scene. CAD application programs tend to use more complex surface shapes,
referred to as patches. These include Bezier patches and NURBS (nonuniform
rational B-spline) patches. These are forms which have curved edges and
continuously varying internal curvature. Real-world shapes can usually be
accurately represented by combining a lot fewer of these more sophisticated
surface patches than are necessary if using triangular polygons.
Whether one uses primitive polygons or curved patches to model the
shape and form of a virtual world, one has to face up to the advantages and
disadvantages of both. The triangular polygon is fast to render and easy to
manipulate (they are also handled by the real-time rendering engines now implemented in the hardware of the graphics processing unit (GPU), discussed
in Chapter 14). The more complex curved patches usually give a better approximation to the original shape, especially if it has many curved parts. Occasionally, a hybrid idea is useful. For example, a cube is modeled just as
effectively with triangular polygons as it is with Bezier patches, but something like the famous Utah teapot, shown in Figure 5.1 with polygonal and
Bezier patch representations, is much better represented by 12 patches than
960 triangular polygons.
Whatever we use to describe the surface of a 3D model, it must be located in space, which means attaching it to three or more vertices or points
i
i
89
i
i
90
into a list of surface material properties, which will record details such
as color, shininess, reflectivity or bumpiness.
2. In this organization, m polygons are obtained from a list of n m
vertices by making an implied assumption that, in the case of triangular
polygons (m = 3), and that three consecutive vertices in the list define
a polygon. Each vertex may also hold a color value or index into a list
of material properties in the same way as with organization 1.
These data-organization schemes are summarized in Figure 5.2. The first
organization keeps down the number of vertices that have to be stored, but
at the expense of having a second independent list of polygons. However, in
cases when the mesh models of objects need to be created with different levels
of detail,1 the first scheme may be better because one can augment the entry
for each polygon with information about its neighboring polygons and more
readily subdivide the mesh to increase the detail. The second organization is
the one used by hardware rendering libraries such as OpenGL or Direct3D,
and some variations have been devised to account for particular mesh topologies in which the number of vertices that need to be specified can be reduced.
The triangle strip and triangle fan are examples.
It is easy to obtain the data in organization 1 from the data in organization 2, but extra computation is needed to do the reverse.
5.1.1
The 3D data storage schemes outlined in Figure 5.2 also include methods of
assigning surface attributes to the polygons (organization 1) or the vertices
(and implied polygons) in organization 2. In real-time rendering systems
(which are obviously very important for VR) where we desire to keep the vertex count as low as possible, using material properties that can include color,
shininess and texture mapping2 makes a significant contribution to increasing the realism of the plain geometric data. Textures are usually regarded as a
surface property of the virtual objects and scenery, and they most often follow
the texture coordinates assigned to each vertex. Because textures often involve
1
By storing several models of the same shape, each made to a higher degree of accuracy
by using more polygonal facets, an application can speed up its rendering by using the ones
with fewer polygons when the shape is further away from the camera. It can also interpolate
between models made with different levels of detail as the viewpoint approaches the object so
that the switch between the two models is imperceptible.
2
Texture mapping is also known as image mapping, decalling, billboarding or just plain
texturing.
i
i
91
Figure 5.2. Data organization for the description of a 3D piecewise planar model.
The triangle strip and triangle fan topology allow for fewer vertices to be stored in
organization 1. For example, the triangle strip with 4 polygons needs only 7 vertices
instead of 15, and the fan needs 8 instead of 21.
very large amounts of data, they are not stored on a per-vertex basis but in a
separate list where several layers of image data and other items such as light
maps, transparency maps and illumination properties can all be kept together.
A virtual world may need 1, 000, 000 polygons to represent its geometry, but
1050 materials may be enough to give the feeling of reality.
i
i
92
5.1.2
Partitioning a polygonal model into sub-units that are linked in a parentchild-grandchild relationship is useful. For example, hierarchical linkages are
very significant in character animation, robotics and chain linkages. This
merges in well with simulations that involve kinematic and inverse
kinematic (IK) systems where force feedback plays an important part. An
easily implemented and practically useful scheme is one that is analogous to
the familiar file-store structure of computer systems; i.e., a root directory with
files and subdirectories which themselves contain files and subdirectories and
so on, to whatever depth you like. A doubly linked list of hierarchy entries is
the best way to organize this data. In addition to previous and next pointers,
each entry will have a pointer to its parent (see Figure 5.3). Other data can
easily be appended to the structure as necessary. When using the structure for
character animation or IK robotic models, the entries can be given symbolic
names. An example of this is shown in Figure 5.3.
To complete a hierarchical data structure, every polygon or vertex is identified as belonging to one of the hierarchical names. This is a similar concept
to how files on a computer disk are identified by the folders in which they
are stored. In a connected network of vertices and facets, it is probably more
useful to associate vertices, rather than polygons, with one of the hierarchical
names.
Head
Parent
Child
Body
Left Leg
Grandchild
Arm
Hand
Left Foot
Figure 5.3. A hierarchical data structure for character animation and articulated
figures.
i
i
5.1.3
93
Figure 5.4. Modeling the natural environment using a polygonal model and a fractal
description. The image on the left was generated using 120, 000 polygons, while
the one on the right required storing only a few fractal parameters. (Rendering still
requires the scene to be polygonized, but there are no boundaries to the scene or level
of detail issues.)
3
A fractal is a form of computer-generated art that creates complex, repetitive, mathematically based geometric shapes and patterns that resemble those found in nature.
i
i
94
5.2
Scene Graphs
i
i
95
Figure 5.5. Two scene graphs representing a scene with two cubes, two spheres and a
triangle.
i
i
96
Figure 5.6. Three graph topologies. (a) The tree has a definite parent-child relation-
ship. (b) The cyclic graph. (c) The acyclic graph, which has no paths leading back to
a node that has been visited before.
The scene-graph developer tools offer platform independence, hierarchical data storage, a selection of viewers and viewpoints (stereoscopic, full
screen, multiple cameras etc.), texturing, billboarding, instancing, selection,
picking, navigation by taking control input from a wide range of input devices
etc. They are usually constructed using object-oriented programming techniques on top of a hardware rendering library such as OpenGL or Direct3D.
There are fully supported commercial scene-graph tools, e.g., Mercurys Inventor [12] and enthusiastically supported open-source packages, e.g., Open
Inventor [22], OpenSG [19] and OpenSceneGraph [15]. Note that OpenSG
should not be confused with OpenSceneGraph. They are entirely different
scene-graph packages, even though they share similar names.
The open-source packages have the advantage that we can build them
from source for use on different platforms. However, the idea of being able
to extend them or add to them is rarely practical because they tend to very
large. And sadly, open-source packages often have limited documentation
which can be very hard to follow. In these cases, one has to make do with minor modifications to the distributed examples, which is actually often
sufficient.
We will briefly explore two open-source packages that encapsulate the
OpenGL (see Chapter 13) 3D graphics rendering library: Open Inventor
and OpenSceneGraph. Since scene graphs are object oriented, it is natural
that both packages have been written in the C++ language. They provide a
number of C++ classes with which to construct the scene graph, arrange its
visualization and allow for the user to interact with iteverything we need
for VR. It is interesting to compare these tools, because they are essentially
doing the same thing, only using different terminology. Often some of the
i
i
97
#include <stdlib.h>
// Header file for the X windows platform and viewer
#include <Inventor/Xt/SoXt.h>
#include <Inventor/Xt/viewers/SoXtExaminerViewer.h>
// Header file for the nodes in the scene graph. Separator
// nodes divide the graph up into its components.
#include <Inventor/nodes/SoSeparator.h>
// Header for the lighting classes
#include <Inventor/nodes/SoDirectionalLight.h>
// Header for the materials classes
#include <Inventor/nodes/SoMaterial.h>
// Header for a cube object class
#include <Inventor/nodes/SoCube.h>
Listing 5.1. The structure of an Open Inventor program for drawing a primitive shape
i
i
98
hardest things to come to terms with when you are confronted with a software
development environment are the terms used to describe the actions of the
tools. In the context of the scene graph, these are just what constitutes a node
of the graph, how the scene graph is partitioned and in what order its nodes
are processed.
Open Inventor (OI) [22] from Silicon Graphics (the inventors of
OpenGL) has the advantage of being a well-documented scene-graph system;
see the book by Wernecke [26]. The toolkit contains four principal tools:
1. A 3D scene database that allows mesh models, surface properties and
elements for user interaction to be arranged in a hierarchical structure.
2. Node kits that are used for bringing in pre-built node components.
3. Manipulator objects that are placed in the scene to allow the user to
interact with its objects. For example, a manipulator object will allow
the user to select and move other objects in the scene.
4. A component library for the X Window display environment that delivers the usual range of window behavior, including a display area,
an event handler (for mouse movement etc.), an execution loop that
repeats continuously while the program is running and different appearances for the window container, e.g., a simple viewer or an editor
window decorated with command buttons and menus.
A simple Open Inventor scene graph is illustrated in Figure 5.7, and the
key instructions for a template program can be found in Listing 5.1. OI also
offers a set of wrappers to allow equivalent coding in C. Nodes in the graph
may carry geometry information or material properties, or be a transformation, a light or camera etc. Alternatively, nodes could be group nodes, which
bring together other nodes or groups of nodes to form the hierarchy. Separator group nodes isolate the effect of transformations and material properties
into a part of the graph.
5.2.1
Open Inventor
5.2.2
OpenSceneGraph
OpenSceneGraph (OSG) [15] is a vast and flexible package that covers many
of the same ideas as Open Inventor. It too is composed of a number of tools
that manifest themselves as the C++ classes:
i
i
99
Figure 5.7. Open Inventor scene graphs allow groups to be either separated from or
Core classes. osg:: These represent the nodes in the graph, a grouping
of nodes, the objects in the scene (the drawables), the transformations
that change the appearance and behavior of the objects and many other
things such as the definition of light sources and levels of detail.
Data classes. osgDB:: These provide support for managing the plugin
modules, which are known as NodeKits, and also for managing loaders
which read 3D scene data.
Text classes. osgText:: These are NodeKits for rendering TrueType
fonts.
Particle classes. osgParticle:: This is a NodeKit for particle systems.
This is a particularly useful feature for VR environments requiring the
simulation of smoke, cloud and flowing water effects.
Plugin classes. osgPlugins:: A large collection for reading many
kinds of image file formats and 3D database formats.
i
i
100
Figure 5.8. An OpenSceneGraph scene graph shows the group nodes and their at-
tached drawable objects connected through the Geode class. Multiple copies of the
same object at different locations are obtained by applying different transformations
(called a position and attitude transformation (PAT)).
i
i
101
of navigation methods for traveling around the scene and options for windowed or
full-screen viewing.
i
i
102
5.3
In Section 5.1, we described a raw data format for 3D scenes and objects.
Variations on this format are used by all the well-known computer animation
and modeling packages. Even with such similar data organization, it can still
be a major undertaking to take the mesh models of an object constructed in
one package and use them in another package. Often quite separate conversion software has to be written or purchased. However, what happens if you
want to make your 3D content available on the Web? Indeed, why not make
it interactivea virtual world on the Web. This is happening already: realtors, art galleries, museums, even cities are offering Web browsers the ability
to tour their homes for sale and look at their antiquities, online, interactively
and in detailed 3D. All this is possible because a standard has evolved with
textural descriptions of a virtual scene and a scripting language to facilitate
user interaction, such as appearing to touch objects, triggering sound effects
and moving about. The ideas from the Virtual Reality Modeling Language
(VRML) has evolved into an Extensible Markup Language (XML) standard
called X3D.
5.3.1
X3D/VRML/XML
i
i
103
tion 5.5.3.) In addition to providing a standard for high-quality video compression and streaming, it also allows a time-varying 2D or 3D world to be
represented, using a hierarchical scene description format based on
X3D/VRML. Objects within the scene may contain 2D still and moving
images, 2D and 3D graphics and sound. It supports user interaction, synchronization between objects and efficient compression of both objects and
scene description. A typical MPEG-4 data stream may resemble that shown
in Figure 5.9, where two video streams are mixed in with a 3D scene description. A user can interact with the objects in the scene by moving them around
with a pointing device.
Figure 5.9. An MPEG-4 data stream can contain 3D structural elements as well as
video (and audio). The video streams can be overlaid with the 3D objects moved
around the output window or even texture-mapped onto 3D objects. The MPEG-4
data format is flexible enough to act as a complete interactive multimedia container
for both streamed or downloaded source material.
X3D and its incarnation as XML format look a little different, but they are
essentially the same. An X3D encoding of the previous code fragment would
look like:
i
i
104
i
i
105
5.4
i
i
106
Listing 5.3. A VRML file describing a textured cube exported from OpenFXs [3]
modeler program.
i
i
107
Experts Group) format. Other formats that frequently crop up are the PNG
(Portable Network Graphics) and GIF (Graphics Interchange Format). Readily available software libraries that encode and decode these files into raw
24-bit pixel arrays can be easily integrated with any VR application. We show
how to do this in quite a few of the utilities and application programs described in Part II.
Nevertheless, it is useful to have an appreciation of the organization of
the file storage, so that it can be adapted to suit special circumstances, such as
stereoscopic imaging, which we discuss in Chapters 10 and 16.
5.4.1
i
i
108
coding mechanism (Huffman encoding) to produce the final format. After that, they are stored in a file structured to accommodate sufficient information for a decoding process to recover the
image, albeit with a little loss of information.
In making use of images encoded in JPEG format, we will need to know
how the compressed data is organized in the file. In fact, sometimes it is
actually harder to get details like this than details of the encoding algorithms
themselves. One thing we can say is that a lot of image file formats, and most
3D mesh description file formats, arrange their data in a series of chucks
within the overall file. Each chunk usually has an important meaning, e.g.,
image size, vertex data, color table etc. In the case of a JPEG file, it conforms to the exchangeable image file format (Exif ) [23] specification. This
allows extra information to be embedded into the JPEG file. Digital camera
manufacturers routinely embed information about the properties of their
cameras in what are called application markers (APPx). In Section 16.2,
we will see how an APP3 is used to embed stereoscopic information in a
JPEG file.
We will now briefly discuss the JPEG file layout. Hamilton [5] gives
an extensive outline of the file organization within the Exif framework. Table 5.1 and Figure 5.10 show the key chunk identifiers. Every marker in an
Exif file is identified with a double-byte header followed immediately by an
unsigned 16-bit integer, stored in big-endian5 order that gives the size of the
chunk.
The first byte in any header is always 0xff,6 so the 16-bit header takes
the form 0xff**. The first two bytes in an Exif file are always 0xffd8, representing the start of image (SOI) marker. Immediately following the opening
bytes is the APP0 segment marker 0xffe0, followed by the segment length
(always 16 bytes). As Figure 5.10 shows, the function of the APP0 marker is
5
Storing integer numbers in binary form which exceed 255 in magnitude will require more
than one byte. A 16-bit integer will have a high-order byte and a low-order byte. In writing
these two bytes in a file, one has the option to store either the high-order one first or the loworder one first. When the high-order byte is written first, this is called big-endian order. When
the low-order byte is written first it is called little-endian order. Computers based on Intel
processors and the Windows operating system use little-endian storage. Other processors, Sun
and Apple systems, for example, use big-endian ordering for binary data storage.
6
The 0x... notation, familiar to C programmers, specifies a byte value using the hexadecimal numbering system. Two hex digits per 8 bits, e.g., 0xff = 255 (decimal).
A 16-bit number requires 4-hex digits, e.g., 0x0100 = 256 (decimal).
i
i
Marker Value
0xfd
0xe0
0xe1
0xe2
0xe3
0xef
0xc0
0xc4
0xdb
0xda
0xd9
Identifier
SOI
APP0
APP1
APP2
APP3
APP15
SOF0
DHT
DQT
SOS
EOI
109
Function
Start of image
Application marker 0 (image dimensions)
Application marker 1
Application marker 2
Application marker 3 (Stereo Image etc.)
Application Marker 15
Image encoding is baseline standard
Define the Huffman table
Define the quantization table
Start of scanthe image data follows
End of image
Table 5.1. The most significant JPEG file segment markers for basic images using
to give the dimensions of the image and, for example, whether the file also
contains a thumbnail impression of the full-sized image.
The key segment which stores the actual image data is the DQT segment with the quantization table containing the coefficients which are used
to multiply the frequency components after the DCT stage. A decoder can
use these to recover the unscaled frequency components; i.e., it divides by the
coefficients to reverse. The DHT segment stores the entries in the Huffman
encoding table so that the decoders can recover the scaled frequency component values. The start of scan (SOS) segment contains the actual compressed
image data from the 8 8 blocks of pixels. These are used together with the
image size values from the APPO header to rebuild the image for viewing.
Computer codes to implement JPEG encoding and decoding are freely
available, and we have found those from [9] to work well in practice. They
can also be easily integrated into our VR application programs. Rich Geldreich [4] has also implemented a useful concise library based on a C++ class for
decoding only.
5.4.2
The most enduring lossless image storage format is undoubtedly the GIF format. It is ideal for small graphics containing text, such as those used on webpages. It has one major drawback, in that it requires a color palette and is only
able to reproduce 256 different colors in the same image. At the core of the
GIF algorithms is Lempel-Ziv and Haruyasu (LZH) [29] encoding. Variants
i
i
110
Figure 5.10. The structure of a JPEG file showing the markers and expanded contents
of typical APP0, APP1, APP2 and APP3 chunks. More details of the chunks are
given in Table 5.1.
of this appear in compressed file archive systems such as ZIP and TGZ. The
PNG [20] formats have been relatively successful for losslessly storing 24-bit
color images. In a way similar to the Exif file organization, PNG files can
have application-specific information embedded with the picture data, and so
for example, they can be used to encode stereoscopic formatting information.
i
i
5.5
111
Storing Video
If the high compression ratios attainable with the JPEG algorithm are deemed
necessary for single images they become indispensable when working with
video. Digital video (DV), digital versatile disc (DVD), Moving Picture
Experts Group (MPEG-1/2/3/4...), Xvid and Microsofts WMV all involve
video compression technologies. They do not, however, define how the data
should be stored, just how it should be compressed. For storage, the compressed video data will need to be inserted into a container file format of
some sort along with synchronized audio and possibly other multimedia content. Apples QuickTime [1, 7] was one of the first movie container formats
to support multiple tracks indexed by time and which was extensible so that
it can offer the inclusion of optional and efficient compression technology.
Typical of the container file types that have become synonymous with video
on the PC platform is Microsofts AVI (audio video interleave) [14]. It is very
flexible, since the video and audio streams it contains may be compressed with
a wide selection of encoding methods.
5.5.1
AVI
AVI does not define any new method of storing data. It is an instance of Microsofts resource interchange file format (RIFF) specifically containing synchronized audio and video data. The RIFF format is typical of many data
formats in that it is divided into chunks, which in turn are divided into subchunks. Within a chunk and sub-chunk, a header provides information about
the contents of the chunk. In the case of video data, the header will contain
such items as the width and height of the video frame, the number of data
streams in the chunk, the time between frames and the size of the chunk. For
AVI files, the sub-chunk header might indicate that it contains two streams,
for example one containing audio samples and the other video frames. Interleaving means that the stream data is arranged so that the picture data for
each video frame alternates with the data for a set of audio samples which
are to be played back during the time the video frame is onscreen. AVI files
allow for any compression method to be applied to the video samples and/or
the audio samples. An AVI file might, for example, contain video in uncompressed form, compressed by simple run-length encoding or the H.264/AVC
(advanced video coding) [21], which is used for broadcast-quality steaming
video with MPEG-4.
i
i
112
5.5.2
Codecs
The accompanying compressed audio stream was enhanced in the early 1990s by adding
a system that became known as Musicam or Layer 2 audio. The Musicam format defined the
basis of the MPEG audio compression format: sampling rates, structure of frames, headers
and number of samples per frame. Its technologies and ideas were fully incorporated into the
definition of ISO MPEG Audio Layers 1, 2 and later 3, which when used without any video
stream gives rise to the familiar MP3 audio encoding format. Uncompressed audio as stored
on a compact disc has a bit rate of about 1400 kbit/s. MP3 as a lossy format is able to compress
this down to between 128 and 256 kbit/s.
i
i
113
out having to change things radically, it was possible to increase the resolution
and compress broadcast-quality video, resolution 720 480 for NTSC video
and 720 572 for PAL video. The enhanced standard, MPEG-2, defines
the encoding and decoding methods for DVDs, hard-disk video recorders
and many other devices. It is so successful that the plans for an MPEG-3
especially for high-definition television were scrapped in favor of MPEG-4,
which can accommodate a much broader range of multimedia material within
its specification. However, for audio and video, the key concepts of how the
encoding is done remain basically the same since MPEG-1. We will look
briefly at this now.
5.5.3
The success of the JPEG encoding method in compressing single digital images opens the door to ask whether the same high compression ratios can be
achieved with digital video. There is one drawback, in that the video decompressor must be able to work fast enough to keep up with the 25 or 30
frames per second required. The MPEG compression standard delivers this
by combining the techniques of transformation into the frequency domain
with quantization (as used in JPEG) and predictions based on past frames
and temporal interpolation. Using these techniques, a very high degree of
compression can be achieved. By defining a strategy for compression and
decompression rather than a particular algorithm to follow, the MPEG encoding leaves the door open for research and development to deliver higher
compression rates in the future.
Figure 5.11 illustrates the ideas that underlie the MPEG compression
concept. In Figure 5.11(a), each frame in the video stream is either compressed as-is using DCT and entropy encoding, or obtained by prediction
from an adjacent frame in the video stream. In Figure 5.11(b), a frame is
broken up into macro-blocks, and the movement of a macro-block is tracked
between frames. This gives rise to an offset vector. In Figure 5.11(c), a difference between the predicted position of the macro-block and its actual position will give rise to an error. In the decoder (Figure 5.11(d)), a P-frame is
reconstructed by applying the motion vector to a macro-block from the most
recent I-frame and then adding in the predicted errors.
An MPEG video stream is a sequence of three kinds of frames:
1. I-frames can be reconstructed without any reference to other frames.
2. P-frames are forward predicted from the previous I-frame or P-frame.
i
i
114
3. B-frames are both forward predicted and backward predicted from the
previous/next I- or P-frame. To allow for backward prediction, some
of the frames are sent out-of-order, and the decoder must have some
built-in buffering so that it can put them back into the correct order.
This accounts for part of the small delay which digital television transmissions exhibit. Figure 5.11(a) illustrates the relationship amongst the
frames.
Figure 5.11(b) shows how prediction works. If an I-frame shows a picture
of a moving triangle on a white background, in a later P-frame the same
triangle may have moved. A motion vector can be calculated that specifies
how to move the triangle from its position in the I-frame so as to obtain the
position of the triangle in the P-frame. This 2D motion vector is part of the
MPEG data stream. A motion vector isnt valid for the whole frame; instead
the frame is divided into small macro-blocks of 16 16 pixels. (A macro-block
is actually made up from 4 luminance blocks and 2 chrominance blocks of
size 8 8 each, for the same reason as in the JPEG image format.)
Every macro-block has its own motion vector. How these motion vectors are determined is a complex subject, and improvements in encoder algorithms have resulted in much higher compression ratios and faster compression speeds, e.g., in the Xvid [28] and DivX [2] codecs. However, as
Figure 5.11(c) shows, the rectangular shape (and the macro-block) is not restricted to translation motion alone. It could also rotate by a small amount
at the same time. So a simple 2D translation will result in a prediction error.
Thus the MPEG stream needs to contain data to correct for the error. In the
decoder, an I-frame is recovered in two steps:
1. Apply the motion vector to the base frame.
2. Add the correction to compensate for prediction error. The prediction
error compensation usually requires fewer bytes to encode than the
whole macro block. DCT compression applied to the prediction error
also decreases its size.
Figure 5.11(d) illustrates how the predicted motion and prediction error
in a macro-block are recombined in a decoder. Not all predictions come
true, and so if the prediction error is too big, the coder can decide to insert
an I-frame macro-block instead. In any case, either the I-frame version of
the macro-block or the error information is encoded by performing a discrete
cosine transform (DCT), quantization and entropy encoding in a very similar
way to the way that it is done for still JPEG images.
i
i
115
Figure 5.11. The MPEG compression concept. (a) Types of compression applied
to successive images. (b) A frame is broken up into macro-blocks and their motion
tracked. (c) A position error in the predicted position of the macro-block is obtained.
(d) In the decoder, a P-frame is reconstructed.
A similar scheme with macro-blocks is applied to compression of the Bframes, but for B-frames, forward prediction or backward prediction or both
may be used, so as to try and obtain a higher compression ratio, i.e., less error.
5.5.4
i
i
116
5.6
Summary
This chapter has looked at the most common data organizations for storing
3D specifications of virtual objects and scenes. We shall be using this information in Chapters 13 and 14, where we look at application programs to
render a virtual environment in real time and with a high degree of realism.
This chapter also explored the ideas that underlie image storage, first because
image-based texture mapping is so important for realism of virtual environments, and second because modifications to the well-known storage formats
are needed in Chapter 16 when we discuss software to support stereopsis. The
chapter concluded with a look at video and movie file formats, because this
impacts on how multimedia content from recorded and live video sources
might be used in VR.
Bibliography
[1] Apple Computer Inc. Inside Macintosh: QuickTime. Reading, MA: AddisonWesley, 1993.
[2] DivX, Inc. DivX Video Codec. http://www.divx.com/.
[3] S. Ferguson. OpenFX. http://www.openfx.org/, 2006.
[4] R. Geldreich. Small JPEG Decoder Library. http://www.users.voicenet.com/
richgel/, 2000.
[5] E. Hamilton. JPEG File Interchange Format. Technical Report, C-Cube Microsystems, 1992.
[6] J. Hartman, J. Wernecke and Silicon Graphics, Inc. The VRML 2.0 Handbook.
Reading, MA: Addison-Wesley Professional, 1996.
[7] E. Hoffert et al. QuickTime: An Extensible Standard for Digital Multimedia.
In Proceedings of the Thirty-Seventh International Conference on COMPCON,
pp. 1520. Los Alamitos, CA: IEEE Computer Society Press, 1992.
[8] M. Hosseni and N. Georganas. Suitability of MPEG4s BIFS for Development of Collaborative Virtual Environments. In Proceedings of the 10th IEEE
International Workshops on Enabling Technologies: Infrastructure for Collaborative
Enterprises, pp. 299304. Washington, D.C.: IEEE Computer Society, 2001.
i
i
Bibliography
117
[9] Independent JPEG Group. Free Library for JPEG Image Compression. http://
www.ijg.org/, 2001.
[10] T. Lane. Introduction to JPEG. http://www.faqs.org/faqs/compression-faq/
part2/section-6.html, 1999.
[11] T. Lane. JPEG Image Compression FAQ, part 1/2. http://www.faqs.org/faqs/
jpeg-faq//part1/, 1999.
[12] Mercury Computer Systems, Inc. Open Inventor Version 6.0 by Mercury Computer Systems. http://www.mc.com/products/view/index.cfm?id=
31&type=software.
[13] Moving Picture Experts Group. MPEG-4 Description. http://www.
chiariglione.org/mpeg/standards/mpeg-4/mpeg-4.htm, 2002.
[14] J. D. Murray and W. vanRyper. Encyclopedia of Graphics File Formats. Sebastopol, CA: OReilly Media, 1994.
[15] OSG Community.
2007.
OpenSceneGraph. http://www.openscenegraph.org,
[16] H. Peitgen and D. Saupe (editors). The Science of Fractal Images. New York:
Springer-Verlag, 1988.
[17] H. Peitgen et al. Chaos and Fractals: New Frontiers in Science. New York:
Springer-Verlag, 1993.
[18] W. Pennebaker and J. Mitchell. JPEG Still Image Compression Standard. New
York: Van Nostrand Reinhold, 1993.
[19] D. Reiners and G. Voss. OpenSG. http://www.opensg.org, 2006.
[20] G. Roelofs. PNG (Portable Network Graphics). http://www.libpng.org/pub/
png/, 2007.
[21] R. Schafer, T. Wiegand and H. Schwarz. The Emerging H.264/AVC Standard. EBU Technical Review 293 (2003).
[22] Silicon Graphics, Inc. Open Inventor. http://oss.sgi.com/projects/inventor/,
2003.
[23] Technical Standardization Committee on AV & IT Storage Systems and Equipment. Exchangeable Image File Format for Digital Still Cameras: Exif Version
2.2. Technical Report, Japan Electronics and Information Technology Industries Association, http://www.exif.org/Exif2-2.PDF, 2002.
[24] G. Wallace. The JPEG Still Compression Standard. Communications of the
ACM 34:4 (1991) 3144.
i
i
118
[25] T. Wegner and B. Tyler. Fractal Creations, Second Edition. Indianapolis, IN:
Waite Group, 1993.
[26] J. Wernecke and Open Inventor Architecture Group. The Inventor Mentor: Programming Object-Oriented 3D Graphics with Open Inventor, Release 2. Reading,
MA: Addison-Wesley Professional, 1994.
[27] X3D. http://www.web3d.org/x3d/, 2006.
[28] Xvid Codec. http://www.xvid.org/, 2006.
[29] J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23:3 (1977) 337343.
i
i
A Pocket
3D Theory
Reference
For all intents and purposes, the world we live in is three dimensional. Therefore, if we want to construct a realistic computer model of it, the model
should be in three dimensions. Geometry is the foundation on which computer graphics and specifically 3D computer graphics, which is utilized in VR,
is based. As we alluded to in Chapter 5, we can create a virtual world by storing the locations and properties of primitive shapes. Indeed, the production
of the photorealistic images necessary for VR is fundamentally determined by
the intersection of straight lines with these primitive shapes.
This chapter describes some useful mathematical concepts that form the
basis for most of the 3D geometry that is used in VR. It also establishes the
notation and conventions that will be used throughout the book. It will be
assumed that you have an appreciation of the Cartesian frame of reference
used to map three-dimensional space and that you know the rules for manipulating vectors and matrices; that is, performing vector and matrix arithmetic.
In any case, we review the fundamentals of 3D geometry, since it is essential
to have a general knowledge in this area when we discuss the more complex
topic of rendering, which we cover in Chapter 7. Therefore, in this chapter
we explain how to manipulate the numerical description of a virtual world
so that we will be able to visualize it with views from any location or angle.
Then, in the next chapter, we explain how this jumble of 3D mathematics
leads to the beautiful and realistic images that envelop you in a virtual world.
119
i
i
120
6.1
Coordinate Systems
6.1.1
Cartesian
Figure 6.1 illustrates the Cartesian system. Any point P is uniquely specified
by a triple of numbers (a, b, c). Mutually perpendicular coordinate axes are
conventionally labeled x, y and z. For the point P, the numbers a, b and c
can be thought of as distances we need to move in order to travel from the
origin to the point P (move a units along the x-axis then b units parallel to
the y-axis and finally c units parallel to the z-axis).
In the Cartesian system, the axes can be orientated in either a left- or righthanded sense. A right-handed convention is consistent with the vector cross
product, and as such all algorithms and formulae used in this book assume a
right-handed convention.
Figure 6.1. Right-handed and left-handed coordinate systems with the z-axis vertical.
i
i
6.1.2
121
Spherical Polar
Figure 6.2 shows the conventional spherical polar coordinate system in relation to the Cartesian axes. The distance r is a measure of the distance from
the origin to a point in space. The angles and are taken relative to the
z- and x-axes respectively. Unlike the Cartesian x-, y- and z-values, which
all take the same units, spherical polar coordinates use both distance and angle measures. Importantly, there are some points in space that do not have
a unique one-to-one relationship with an (r, , ) coordinate value. For example, points lying on the positive z-axis can have any value of . That is,
(100, 0, 0) and (100, 0, ) both represent the same point.
Also, the range of values which (r, , ) can take is limited. The radial
distance r is such that it is always positive 0 r < ; lies in the range
0 and takes values < .
It is quite straightforward to change from one coordinate system to the
other. When the point P in Figure 6.2 is expressed as (r, , ), the Cartesian
coordinates (x, y, z) are given by the trigonometric expressions
x = r sin cos ,
y = r sin sin ,
z = r cos .
Conversion from Cartesian to spherical coordinates
is a little more tricky.
The radial distance is easily determined by r = x 2 + y2 + z 2 and follows
by trigonometry; = arccos zr . Finally, we can then determine , again
i
i
122
y
if x > 0,
arctan( x )
y
arctan( x ) if x < 0,
=
otherwise.
2
If performing these calculations on a computer, an algorithm is required
that tests for the special cases where P lies very close to the z-axis; that is, when
and approach zero. A suitable implementation is presented in Figure 6.3.
if (x 2 + y2 ) < {
r=z
=0
=0
}
else {
r = x 2 + y2 + z 2
= arccos(z/r)
= ATAN 2(y, x)
}
i
i
6.2. Vectors
6.2
123
Vectors
The vector, the key to all 3D work, is a triple of real numbers (in most computer languages, these are usually called floating-point numbers) and is noted
in a bold typeface, e.g., P or p. When hand-written (and in the figures of this
book), vectors are noted with an underscore, e.g., P.
Care must be taken to differentiate between two types of vector (see Figure 6.4).
Position vectors. A position vector runs from the origin of coordinates (0, 0, 0) to a point (x, y, z), and its length gives the distance of the
point from the origin. Its components are given by (x, y, z). The essential
concept to understand about a position vector is that it is anchored to specific
coordinates (points in space). The set of points or vertices that are used to
describe the shape of all models in 3D graphics can be thought of as position
vectors.
Thus a point with coordinates (x, y, z) can also be identified as the
end point of a position vector p. We shall often refer to a point as (x, y, z)
or p.
Direction vectors. A direction vector differs from a position vector in
that it is not anchored to specific coordinates. Frequently direction vectors
are used in a form where they have unit length; in this case they are said to
be normalized. The most common application of a direction vector in 3D
computer graphics is to specify the orientation of a surface or direction of a
ray of light. For this we use a direction vector at right angles (normal) and
pointing away from a surface. Such normal vectors are the key to calculating
i
i
124
surface shading effects. It is these shading effects which enhance the realism
of any computer generated image.
6.3
The Line
There are two useful ways to express the equation of a line in vector form (see
Figure 6.5). For a line passing through a point P0 and having a direction d,
any point p which lies on the line is given by
p = P0 + d,
where P0 is a position vector and d is a unit length (normalized) direction
vector.
Alternatively, any point p on a line passing through two points P0 and P1
is given by
p = P0 + (P1 P0 ).
Using two points to express an equation for the line is useful when we
need to consider a finite segment of a line. (There are many examples where
we need to use segments of lines, such as calculating the point of intersection
between a line segment and a plane.)
Thus, if we need to consider a line segment, we can assign P0 and P1 to
the segment end points with the consequence that any point p on the line
will only be part of the segment if its value for in the equation above lies in
the interval [0, 1]. That is, ( = 0, p = P0 ) and ( = 1, p = P1 ).
i
i
6.4
125
The Plane
(P2 P0 ) (P1 P0 )
= 0.
| (P2 P0 ) (P1 P0 ) |
6.4.1
The intersection of a line and a plane will occur at a point pi that satisfies the
equation of the line and the equation of the plane simultaneously. For the
line p = Pl + d and the plane (p Pp ) n = 0, the point of intersection
pI is given by the algorithm shown in Figure 6.7.
i
i
126
if |d n | < { no intersection}
else {
(Pp Pl ) n
=
d n
pI = Pl + d
}
Figure 6.7. Algorithm to determine whether a line and a plane intersect.
Note that we must first test to see whether the line and plane actually
intersect. That is, if the dot product of the direction vector of the line and the
normal vector to the plane is zero, no intersection will occur because the line
and the plane are actually perpendicular to each other. The parameter allows
and n
for the numerical accuracy of computer calculations, and since d
are of
unit length, should be of the order of the machine arithmetic precision.
a = P0 Pp
b = P1 Pp
da = a n
db = b n
if |da | 0 and |db | 0 {
both P0 and P1 lie in the plane
}
else {
dab = da db
if dab < 1 {
The line crosses the plane
}
else {
The line does not cross the plane
}
}
Figure 6.8. Algorithm to determine whether the line joining two points crosses a
plane.
i
i
6.4.2
127
6.4.3
Many 3D algorithms (rendering and polygonal modeling) require the calculation of the point of intersection between a line and a fundamental shape
called a primitive. We have already dealt with the calculation of the intersection between a line and a plane. If the plane is bounded, we get a primitive
shape called a planar polygon. Most 3D rendering and modeling application
programs use polygons that have either three sides (triangles) or four sides
(quadrilaterals). Triangular polygons are by far the most common, because it
is always possible to reduce an n-sided polygon to a set of triangles.
This calculation is used time and time again in rendering algorithms (image and texture mapping and rasterization). The importance of determining
whether an intersection occurs in the interior of the polygon, near one of
its vertices, at one of its edges or indeed just squeaks by outside cannot be
overemphasized. We deal with the implications of these intersections in the
next chapter.
This section gives an algorithm that can be used determine whether a
line intersects a triangular polygon. It also classifies the point of intersection
as being internal, a point close to one of the vertices or within some small
distance from an edge.
The geometry of the problem is shown in Figure 6.9. The point Pi gives
the intersection between a line and a plane. The plane is defined by the
points P0 , P1 and P2 , as described in Section 6.4.1. These points also identity
the vertices of the triangular polygon. Vectors u and v are along two of the
edges of the triangle under consideration. Provided u and v are not collinear,
the vector w = Pi P0 lies in the plane and can be expressed as a linear
combination of u and v:
w = u + v.
(6.1)
i
i
128
Once and have been calculated, a set of tests will reveal whether
Pi lies inside or outside the triangular polygon with vertices at P0 , P1
and P2 .
Algorithm overview:
To find Pi , we use the method described in Section 6.4.1.
To calculate and , take the dot product of Equation (6.1) with
u and v respectively. After a little algebra, the results can be expressed
as
=
(w u)(v v) (w v)(u v)
;
(u u)(v v) (u v)2
(w v)(u u) (w u)(u v)
.
(u u)(v v) (u v)2
i
i
129
6.5
Reflection in a Plane
In many visualization algorithms, there is a requirement to calculate a reflected direction given an incident direction and a plane of reflection. The
vectors we need to consider in this calculation are of the direction type and
assumed to be of unit length.
out and the surface
in , the reflection vector is d
If the incident vector is d
normal is n, we can calculate the reflected vector by recognizing that because
i
i
130
(6.2)
Where is a scalar factor. As the incident and reflected angles are equal,
in n
out n
= d
.
d
(6.3)
(6.4)
out n
into Equation (6.3) to give,
We then substitute this expression for d
in n
in n
+ n
n
) = d
.
(d
(6.5)
in n
) and so
Since n is normalized, n
n
= 1, thus = 2(d
after substituting for in Equation (6.2), we obtain the reflected direction:
in 2(d
in n
out = d
)
n.
d
This is the standard equation used to compute the direction of the reflected vector, and is very useful when studying lighting effects within the
virtual world.
i
i
6.6. Transformations
6.6
131
Transformations
i
i
132
with matrices that have no equivalent for real numbers, such as write their
transpose. Even the familiar 3D dot and cross products of two vectors can be
defined in terms of matrix multiplication.
Several different notations are used to represent a matrix. Throughout
this text, when we want to show the elements of a matrix, we will group
them inside [ ] brackets. Our earlier work used a bold typeface to represent
vectors, and we will continue to use this notation with single-column matrices
(vectors). A single-row matrix is just the transpose of a vector,2 and when we
need to write this we will use pT . When we want to represent a rectangular
matrix as an individual entity, we will use the notation of an italicized capital
letter, e.g., A.
It turns out that all the transformations appropriate for computer
graphics work (moving, rotating, scaling etc.) can be represented by
a square matrix of size 4 4.
If a transformation is represented by a matrix T , a point p is transformed to
a new point p by matrix multiplication according to the rule:
p = T p.
The order in which the matrices are multiplied is important. For matrices
T and S, the product TS is different from the product ST ; indeed, one of
these may not even be defined.
There are two important points which are particularly relevant when using matrix transformations in 3D graphics applications:
1. How to multiply matrices of different dimensions.
2. The importance of the order in which matrices are multiplied.
The second point will be dealt with in Section 6.6.4. As for the first
point, to multiply two matrices, the number of columns in the first must
equal the number of rows in the second. For example, a matrix of size 3 3
and 3 1 may be multiplied to give a matrix of size 3 1. However, a 4 4
and a 3 1 matrix cannot be multiplied. This poses a small problem for
us, because vectors are represented by 3 1 matrices and transformations are
represented as 4 4 matrices.
2
When explicitly writing the elements of a vector in matrix form, many texts conserve
space by writing it as its transpose, e.g., [x, y, z]T , which of course is the same thing as laying
it out as a column.
i
i
6.6. Transformations
133
p0
p1
p=
p2 .
1
Note: The fourth component will play a very important role when we discuss
the fundamentals of computer vision in Chapter 8.
The transformation of a point from p to p by the matrix T can be written
as
t00 t01 t02 t03
p0
p0
p1 t10 t11 t12 t13 p1
6.6.1
Translation
1 0 0 dx
0 1 0 dy
Tt =
0 0 1 dz .
0 0 0 1
i
i
134
This transformation will allow us to move the point with coordinates (x, y, z)
to the point with coordinates (x + dx, y + dy, z + dz). That is, the translated
point becomes p = Tt p or
x + dx
1 0 0 dx
x
y + dy 0 1 0 dy y
z + dz = 0 0 1 dz z .
1
0 0 0 1
1
6.6.2
Scaling
sx 0 0 0
0 sy 0 0
Ts =
0 0 sz 0 .
0 0 0 1
Essentially this allows us to scale (expand or contract) a position vector p
with components (x, y, z) by the factors sx along the x-axis, sy along the yaxis and sz along the z-axis. Thus, for example, if we scaled the three vertex
positions of a polygon, we could effectively change its size, i.e., make it larger
or smaller. The scaled point p is given by p = Ts p or
sx 0 0 0
x
xsx
ysy 0 sy 0 0 y
zsz = 0 0 sz 0 z .
1
0 0 0 1
1
6.6.3
Rotation
In two dimensions, there is only one way to rotate, and it is around a point.
In three dimensions, we rotate around axes instead, and there are three of
them to consider, each bearing different formulae. Thus a rotation is specified by both an axis of rotation and the angle of the rotation. It is a fairly
simple trigonometric calculation to obtain a transformation matrix for a rotation about one of the coordinate axes. When the rotation is to be performed
around an arbitrary vector based at a given point, the transformation matrix
must be assembled from a combination of rotations about the Cartesian coordinate axes and possibly a translation. This is outlined in more detail in
Section 6.6.5.
i
i
6.6. Transformations
135
Figure 6.11. Rotations, anticlockwise looking along the axis of rotation, towards the
origin.
cos sin
sin cos
Tz () =
0
0
0
0
0
0
1
0
0
0
.
0
1
(6.6)
We can see how the rotational transformations are obtained by considering a positive (anticlockwise) rotation of a point P by round the z-axis
(which points out of the page). Before rotation, P lies at a distance l from the
origin and at an angle to the x-axis (see Figure 6.12). The (x, y)-coordinate
of P is (l cos , l sin ). After rotation by , P is moved to P and its coordinates are (l cos( + ), l sin( + )). Expanding the trigonometric sum gives
expressions for the coordinates of P :
Px = l cos cos l sin sin ,
Py = l cos sin + l sin cos .
Since l cos is the x-coordinate of P and l sin is the y-coordinate of P,
the coordinates of P becomes
i
i
136
Px = Px cos Py sin ,
Py = Px sin + Py cos .
Writing this in matrix form, we have
Px
Py
cos sin
sin cos
Px
Py
0
0
1
0
Px
0
Py
0
0 Pz
1
1
cos
0
Ty () =
sin
0
0 sin 0
1
0
0
.
0 cos 0
0
0
1
i
i
6.6. Transformations
137
1
0
0
0
0 cos sin 0
Tx () =
0 sin cos 0 .
0
0
0
1
Note that as illustrated in Figure 6.11, is positive if the rotation
takes place in a clockwise sense when looking from the origin along the
axis of rotation. This is consistent with a right-handed coordinate
system.
6.6.4
Combining Transformations
Section 6.6 introduced the key concept of a transformation applied to a position vector. In many cases, we are interested in what happens when several
operations are applied in sequence to a model or one of its points (vertices).
For example, move the point P 10 units forward, rotate it 20 degrees round the
z-axis and shift it 15 units along the x-axis. Each transformation is represented
by a single 4 4 matrix, and the compound transformation is constructed as
a sequence of single transformations as follows:
p = T1 p;
p = T2 p ;
p = T3 p .
Where p and p are intermediate position vectors and p is the end
vector after the application of the three transformations. The above sequence
can be combined into
p = T3 T2 T1 p.
The product of the transformations T3 T2 T1 gives a single matrix T . Combining transformations in this way has a wonderful efficiency: if a large model
has 50,000 vertices and we need to apply 10 transformations, by combining
i
i
138
6.6.5
The transformation corresponding to rotation of an angle around an arbitrary vector (for example, that shown between the two points P0 and P1
in Figure 6.14) cannot readily be written in a form similar to the rotation
matrices about the coordinate axes.
The desired transformation matrix is obtained by combining a sequence
of basic translation and rotation matrices. (Once a single 4 4 matrix has
i
i
6.6. Transformations
139
6.6.6
Viewing Transformation
Before rendering any view of a 3D scene, one has to decide from where to
view/photograph the scene and in which direction to look (point the camera). This is like setting up a camera to take a picture. Once the camera is
set, we just click the shutter. The camera projects the image as seen in the
viewfinder onto the photographic film and the image is rendered. Likewise,
when rendering a 3D scene, we must set up a viewpoint and a direction of
view. Then we must set up the projection (i.e., determine what we can and
cannot see), which is discussed in Section 6.6.7.
In mathematical terms, we need to construct a suitable transformation
that will allow us to choose a viewpoint (place to set up the camera) and
i
i
140
Let d = P1 P0
T1 = a translation by P0
dxy = dx2 + dy2
if dxy < { rotation axis is in the z-direction
if dz > 0 make T2 a rotation about z by
else T2 = a rotation about z by
T3 = a translation by P0
return the product T3 T2 T1
}
dxy = dxy
if dx = 0 and dy > 0 = /2
else if dx = 0 and dy < 0 = /2
else = ATAN 2(dy , dx )
= ATAN 2(dz , dxy )
T2 = a rotation about z by
T3 = a rotation about y by
T4 = a rotation about x by
T5 = a rotation about y by
T6 = a rotation about z by
T7 = a translation by P0
Multiply the transformation matrices to give the final result
T = T7 T6 T5 T4 T3 T2 T1
Figure 6.15. Algorithm for rotation round an arbitrary axis. Points dx , dy and dz are
direction of view (direction in which to point the camera). Once we have this
view transformation, it can be combined with any other transformations that
needs to be applied to the scene, or to objects in the scene.
We have already seen how to construct transformation matrices that move
or rotate points in a scene. In the same way that basic transformation matrices
were combined in Section 6.6.5 to create an arbitrary rotation, we can build
a single matrix, To , that will transform all the points (vertices) in a scene in
such a way that the projection of an image becomes a simple standard process.
But here is the funny thing. In computer graphics, we do not actually change
the position of the camera; we actually transform or change the position of
all the vertices making up the scene so that camera is fixed at the center of the
universe (0, 0, 0) and locked off to point in the direction (1, 0, 0) (along the
x-axis).
i
i
6.6. Transformations
141
There is nothing special about the direction (1, 0, 0). We could equally
well have chosen to fix the camera to look in the direction (0, 1, 0), the yaxis, or even (0, 0, 1), the z axis. But we have chosen to let z represent the
up direction, and it is not a good idea to look directly up, so (0, 0, 1) would
be a poor choice for viewing. (Note: The OpenGL and Direct3D software
libraries for 3D graphics have their z-axis parallel to the viewing direction.)
Once To has been determined, it is applied to all objects in the scene. If
necessary, To can be combined with other transformation matrices to give a
single composite transformation T .
i
i
142
6.6.7
Projection Transformation
After we set up a camera to record an image, the view must then be projected
onto film or an electronic sensing device. In the conventional camera, this is
done with a lens arrangement or simply a pinhole. One could also imagine
holding a sheet of glass in front of the viewer and then having her trace on
it what she sees as she looks through it. What is drawn on the glass is what
we would like the computer to produce: a 2D picture of the scene. Its even
shown the right way up, as in Figure 6.17.
i
i
6.6. Transformations
143
Figure 6.17. Project the scene onto the viewing plane. The resulting two-dimensional
It is straightforward to formulate expressions needed to perform this (nonlinear) transformation. A little thought must be given to dealing with cases
where parts of the scene go behind the viewer or are partly in and partly out
of the field of view. The field of view (illustrated in Figure 6.18) governs how
much of the scene you see. It can be changed so that you are able to see more
or less of the scene. In photography, telephoto and fish-eye lenses have different fields of view. For example, the common 50 mm lens has a field of view
of 45.9 . Because of its shape as a truncated pyramid with a regular base, the
volume enclosed by the field of view is known as a frustum.
One thing we can do with a projective transformation is adjust the aspect
ratio. The aspect ratio is the ratio of height to width of the rendered image.
It is 4 : 3 for television work and 16 : 9 for basic cine film. The aspect ratio
is related to the vertical and horizontal resolution of the recorded image. Get
this relationship wrong and your spheres will look egg shaped.
Before formulating expressions to represent the projection, we need to
define the coordinate system in use for the projection plane. It has become
almost universal3 to represent the computer display as a pair of integers in
3
However, there are important exceptions, e.g., the OpenGL 3D library, where floatingpoint numbers are used.
i
i
144
Figure 6.18. The field of view governs how much of the scene is visible to a camera
located at the viewpoint. Narrowing the field of view is equivalent to using a zoom
lens.
i
i
6.7. Summary
145
The numerical factor 21.2 is a constant to allow us to specify ff in standard mm units. For a camera lens of focal length ff , the field of view can
be expressed as
2ATAN 2(21.22, ff ).
Any point (x, y, z) for which x < 1 will not be transformed correctly by
Equations (6.7) and (6.8). The reason for this is that we have placed the
projection plane at x = 1 and therefore these points will be behind the projection plane. As such, steps must be taken to eliminate these points before
the projection is applied. This process is called clipping. We talk more about
clipping in the next chapter.
6.7
Summary
Lets take a quick review of what we have covered thus far. We know that
to construct a virtual environment, we need to be able to connect different
primitive shapes together. We then need to be able to assign properties to
these shapes; for example, color, lighting, textures etc. Once we have constructed our virtual environment, we then need to be able to produce single
images of it, which we do by a process of rendering.
Whilst we have not covered the details of the rendering process yet, we do
know that before we render an image, we need to be able to determine what
we will actually see of the virtual environment. Some objects or primitives will
fall into this view and others will not. In order to determine what we can see,
we need to be able to perform some 3D geometry. In this chapter, we covered
the basics of this. In particular, we discussed determining the intersection
points of planes and lines and setting up viewing transformations.
So now that we have looked at how the viewing transformation is constructed, it is a good idea to look at the process of utilizing this to produce a
2D picture of our 3D environment. It is to the interesting topic of rendering
that we now turn our attention.
Bibliography
[1] F. Ayres. Matrices, Schaums Outline Series. New York: McGraw-Hill, 1967.
[2] R. H. Crowell and R. E. Williamson. Calculus of Vector Functions. Englewood
Cliffs, NJ: Prentice Hall, 1962.
[3] S. Lipschutz. Schaums Outline of Linear Algebra, Third Edition, Schaums Outline Series. New York: McGraw-Hill, 2000.
i
i
The Rendering
Pipeline
In this chapter, we are going to take a look inside the most important element of VRthe 3D graphics. Creating realistic-looking 3D graphics (a
process called rendering) poses a particular challenge for VR system builders
because, unlike making a 3D movie or TV show, we must endow our system with a capability for real-time interactivity. In the past, this has meant
that VR graphics engines1 had difficulty in achieving that really realistic look.
Not any more! For this, we have to thank the computer-game industry. It
was the multi-billion dollar market demand for games that accelerated the
development of some spectacular hardware rendering engines, available now
at knockdown prices. VR system builders can now expect to offer features
such as stereopsis, shadowing, high dynamic lighting range, procedural textures and Phong shadingrendered in high resolution and in real time and
with full interactivity.
In Chapters 13 and 14 we will look at how to write application programs
that use the 3D hardware accelerated rendering engines, but here we intend
to look at the algorithmic principles on which they operate. In using any
high-tech technology, we hope you will agree that it is of enormous benefit to
know, in principle, how it works and what is going on inside, so here goes.
The basic rendering algorithm takes a list of polygons and vertices and
produces a picture of the object they represent. The picture (or image) is
recorded as an array of bytes in the output frame buffer. This array is made up
1
The term engine is commonly used to describe the software routines of an application
program that carry out the rendering. It originated in the labs of computer-game developers.
147
i
i
148
of little regions of the picture called fragments.2 The frame buffer is organized
into rows, and in each row there are a number of fragments. Each fragment in
the frame buffer holds a number which corresponds to the color or intensity
at the equivalent location in the image. Because the number of fragments in
the frame buffer is finite (the fragment value represents a small area of the real
or virtual world and not a single point), there will always be some inaccuracy
when they are used to display the image they represent.
The term graphics pipeline is used to describe the sequence of steps that
the data describing the 3D world goes through in order to create 2D images
of it. The basic steps of the graphics pipeline are:
1. Geometry and vertex operations. The primary stage of the graphics
pipeline must perform a number of operations on the vertices that
describe the shape of objects within a scene or the scene itself. Example operations include transformation of all vertex coordinates to a
real-world frame of reference, computation of the surface normal and
any attributes (color, texture coordinate etc.) associated with each vertex etc.
2. Occlusion, clipping and culling. Occlusion attempts to remove polygons quickly from the scene if they are completely hidden. (Occlusion
is different from clipping or culling because algorithms for occlusion
operate much faster, but sometimes make a mistake, so it is typically
applied only to scenes with enormous polygon counts, greater than
1 106 to 2 106 .) Clipping and culling again remove polygons or
parts of polygons that lie outside a given volume bounded by planes
(called clipping planes), which are formed using a knowledge of the
viewpoint for the scene. This allows the rasterizer to operate efficiently,
and as a result, only those polygons that we will actually see within our
field of view will be rendered.
3. Screen mapping. At this stage in the graphics pipeline, we have the 3D
coordinates of all visible polygons from a given viewpoint for the scene.
Now we need to map these to 2D screen or image coordinates.
4. Scan conversion or rasterization. The rasterizer takes the projections
of the visible polygons onto the viewing plane and determines which
fragments in the frame buffer they cover. It then applies the Z-buffer
2
An individual fragment corresponds to a pixel that is eventually displayed on the output
screen.
i
i
149
7.1
At this stage in the graphics pipeline, we need to perform some basic operations at each vertex within the scene. As such, the amount of work that has to
be done is in proportion to the number of vertices in the scene, or the scene
complexity. Essentially, these operations can be summarized as:
Modeling transformation. When creating a 3D environment, we will
almost always require some 3D objects or models to place within that
environment. These models will have been created according to a local
frame of reference. When we insert them into our 3D environment,
we must transform all their vertex coordinates from that local reference
frame to the global frame of reference used for our 3D environment.
This transformation can of course include some scaling, translation and
rotation, all of which were outlined in Section 6.6.
Viewing transformation. Once our global 3D environment has been
created, we have basically a 3D model described by a list of polygons
i
i
150
7.2
In complex environments, it is unlikely that you will be able to see every detail
of the environment from a given viewpoint. For example, an object might be
in complete shadow, it might be occluded by another object or perhaps it
is outside the viewing area. Since rendering is very processor-intensive, it
seems natural to try to avoid even thinking about rendering those objects that
you cannot see. Therefore, this stage of the graphics pipeline aims to reject
objects which are not within the field of view of the virtual camera and that
are occluded by other objects within the environment.
We will use the term culling to apply to the action of discarding complete
polygons; i.e., those polygons with all their vertices outside the field of view.
Clipping will refer to the action of modifying those polygons (or edges) that
are partially in and partially out of the field of view. Culling should always
be performed first, because it is likely to give the biggest performance gain.
Figure 7.1 illustrates the principle of culling, clipping and occluding. In Figure 7.1(c), the importance of occlusion becomes evident in high-complexity
scenes. A building plan is depicted with a large number of objects that fall
inside the field of view but are occluded by the walls between rooms. These
i
i
151
ca
m
a
er
i
i
152
7.2.1
Culling
i
i
153
Figure 7.2. Surface normal vectors for a cube and a plan of the vertices showing a
consistent ordering of the vertices round each polygonal face. For example, n1 =
(P2 P1 ) (P3 P1 ).
7.2.2
Clipping
After the culling stage, we will probably have rejected a large number of
polygons that could not possibly be seen from the viewpoint. We can also
be sure that if all of a polygons vertices lie inside the field of view then it
might be visible and we must pass it to the rasterization stage. However, it is
very likely that there will still be some polygons that lie partially inside and
outside the field of view. There may also be cases where we cannot easily tell
whether a part of a polygon is visible or not. In these cases, we need to clip
away those parts of the polygon which lie outside the field of view. This is
non-trivial, because we are essentially changing the shape of the polygon. In
some cases, the end result of this might be that none of the polygon is visible
at all.
Clipping is usually done against a set of planes that bound a volume. Typically this takes the form of a pyramid with the top end of the pyramid at the
viewpoint. This bounded volume is also known as a frustum. Each of the
sides of the pyramid are then known as the bounding planes, and they form
the edges of the field of view. The top and bottom of the frustum form the
front and back clipping planes. These front and back clipping planes are at
right angles to the direction of view (see Figure 7.3). The volume contained
within the pyramid is then known as the viewing frustum. Obviously, polygons and parts of polygons that lie inside that volume are retained for rendering. Some polygons can be marked as lying completely outside the frustum as
the result of a simple test, and these can be culled. However, the simple test is
often inconclusive and we must apply the full rigor of the clipping algorithm.
i
i
154
Figure 7.3. Cubic and truncated pyramidal clipping volumes penetrated by a rectangular polygon. Only that portion of the rectangle which lies inside the clipping
volume would appear in the rendered image.
i
i
155
Figure 7.4. Clipping a triangular polygon, ABC , with a yz plane at PP , at (xp , 0, 0).
Clipping divides ABC into two pieces. If the polygons are to remain triangular, the
piece ADEC must be divided in two.
py = P1y +
pz
7.3
Screen Mapping
i
i
156
screen coordinates. For example, in the real world, when we use a camera,
we create a 2D picture of the 3D world. Likewise, in the virtual world, we
now need to create a 2D image of what our virtual camera can see based on
its viewpoint within the 3D environment. 3D coordinate geometry allows us
to do this by using a projective transformation, where all the 3D coordinates
are transformed to their respective 2D locations within the image array. The
intricacies of this are detailed in Section 6.6.7. This process only affects the
3D y- and z-coordinates (again assuming the viewing direction is along the
x-axis). The new 2D (X , Y )-coordinates, along with the 3D x-coordinate, are
then passed on to the next stage of the pipeline.
7.4
At the next stage, the visible primitives (points, lines, or polygons) are decomposed into smaller units corresponding to pixels in the destination frame
buffer. Each of these smaller units generated by rasterization is referred to
as a fragment. For instance, a line might cover five pixels on the screen, and
the process of rasterization converts the line (defined by two vertices) into five
fragments. A fragment is comprised of a frame buffer coordinate, depth information and other associated attributes such as color, texture coordinates and
so on. The values for each of these attributes are determined by interpolating
between the values specified (or computed) at the vertices of the primitive.
Remember that in the first stage of the graphics pipeline, attributes are only
assigned on a per-vertex basis.
So essentially, this stage of the pipeline is used to determine which fragments in the frame buffer are covered by the projected 2D polygons. Once it
is decided that a given fragment is covered by a polygon, that fragment needs
to be assigned its attributes. This is done by linearly interpolating the attributes assigned to each of the corresponding 3D vertex coordinates for that
polygon in the geometry and vertex operations stage of the graphics pipeline.
Now, if we simply worked through the list of polygons sequentially and
filled in the appropriate fragments with the attributes assigned to them, the
resulting image would look very strange. This is because if the second sequential polygon covers the same fragment as the first then all stored information
about the first polygon would be overwritten for that fragment. Therefore,
as the second polygon is considered, we need to know the 3D x-coordinate
of the relevant vertex and determine if it is in front or behind the previous
x-coordinate assigned to that fragment. If it is in front of it, a new attribute
i
i
157
value for the second polygon can be assigned to that fragment and the associated 3D x-coordinate will be updated. Otherwise, the current attributes will
remain unchanged. This procedure determines which polygon is visible from
the viewpoint. Remember, we have defined the viewing plane to be at x = 1.
Therefore, the 3D x-coordinate of each vertex determines how far it is from
the viewing plane. The closer it is to the viewing plane, the more likely that
the object connected to that vertex will not be occluded by other objects in
the scene. Essentially, this procedure is carried out by the Z-buffer algorithm,
which we will now cover in more detail.
7.4.1
The Z-buffer algorithm is used to manage the visibility problem when rendering 3D scenes; that is, which elements of the rendered scene will be visible
and which will be hidden. Look at Figure 7.5; it shows a polygon and its
projection onto the viewing plane where an array of fragments is also illus-
Figure 7.5. Projecting back from the viewpoint through pixel (i, j) leads to a point in
the interior of polygon k.
i
i
158
i
i
159
7.5
Fragment Processing
In the early generations of hardware graphics processing chips (GPUs), fragment processing primarily consisted of interpolating lighting and color from
the vertex values of the polygon which has been determined to be visible
in that fragment. However, today in software renderers and even in most
hardware rendering processors, the lighting and color values are calculated on
a per-fragment basis, rather than interpolating the value from the polygons
vertices.
In fact, determining the color value for any fragment in the output frame
buffer is the most significant task any rendering algorithm has to perform. It
governs the shading, texturing and quality of the final rendering, all of which
are now outlined in more detail.
7.5.1
Shading (Lighting)
The way in which light interacts with surfaces of a 3D model is the most significant effect that we can model so as to provide visual realism. Whilst the
Z-buffer algorithm may determine what you can see, it is mainly the interaction with light that determines what you do see. To simulate lighting effects,
it stands to reason that the location and color of any lights within the scene
must be known. In addition to this, we also need to classify the light as being
one of three standard types, as illustrated in Figure 7.7. That is:
1. A point light source that illuminates in all directions. For a lot of
indoor scenes, a point light source gives the best approximation to the
lighting conditions.
2. A directional or parallel light source. In this case, the light comes from
a specific direction which is the same for all points in the scene. (The
illumination from the sun is an example of a directional light source.)
i
i
160
Figure 7.8. Lights, camera and objects. Reflected light finds its way to the observer by
being reflected from the objects surfaces.
i
i
161
Figure 7.9. Diffuse illumination. (a) The brightest illumination occurs when the
incident light direction is at right angles to the surface. (b) The illumination tends
to zero as the direction of the incident light becomes parallel to the surface.
Pl p
n
.
|Pl p|
i
i
162
Pl p
|Pl p|
( 1 )
Pl p
Id =
|P p|
(2 1 )
0 l
if < 1 ,
if 1 2 ,
if 2 < ,
i
i
163
First calculate the vector b which bisects the vectors between p and
the viewpoint and between p and Pl . The vector b takes an angle
between these vectors. From that, the angle is easily determined
n
(the surface normal n
is known), whilst b is given
because cos = b
by
p Pv
p Pl
,
b =
|p Pl | |p Pv |
= b.
b
|b|
In Figure 7.10 we note that
=+
and that
= + .
or
1)m ,
2
n
= (2(b
)2 1)m .
Is = (2 cos2
Is
or
i
i
164
i
i
165
RGB color model, where a color is stored using three components, one for
each of the primary colors, red, green and blue. Any color the eye can perceive
can be expressed in terms of an RGB triple. Thus c is recorded as cR , cG , cB ,
which are usually stored as unsigned 8-bit integers, giving a range of 256
discrete values for each color component. For preliminary calculation, it is
usually assumed that cR , cG and cB are recorded as floating-point numbers in
the range [0, 1]. We will also assume that any light or surface color is also
given as an RGB triple in the same range. To determine the color that is
recorded in a fragment, the effect of the lights within the scene need to be
combined with the surface properties of the polygon visible in that fragment.
The simplest way to do this is to break up the mathematical model for light
and surface interaction into a number of terms, where each term represents a
specific physical phenomenon.
In the following expressions, sR , sG , sB represents the color of the surface,
and lR , lG , lB the color of the light. Ia , Ic , Id and Is are the four contributions
to the lighting model that we have already discussed. Using this terminology,
the color c calculated for the fragment in question may be expressed as
cR = Ia sR + Ic (Is + Id sR )lR ,
cG = Ia sG + Ic (Is + Id sG )lG ,
cB = Ia sB + Ic (Is + Id sB )lB .
To form general expressions for the effect of lighting a surface in a scene,
at some point p, with n lights, these equations become
n1
(Is (i) + Id (i)sR )lR (i),
(7.1)
cR = Ia sR + Ic
cG
= Ia sG + Ic
cB = Ia sB + Ic
i=0
n1
i=0
n1
(7.2)
(7.3)
i=0
i
i
166
7.5.2
Look at Figure 7.11. It shows two pictures of the same faceted model of a
sphere. The one on the right looks smooth (apart from the silhouetted edges,
which we will discuss later). The one on the left looks like just what it is, a
collection of triangular polygons. Although the outline of neither is perfectly
circular, it is the appearance of the interior that first grabs the attention. This
example highlights the main drawback of the representation of an object with
a model made up from polygonal facets. To model a sphere so that it looks
smooth by increasing the number (or equivalently decreasing the size) of the
facets is quite impractical. Thousands of polygons would be required just for
a simple sphere. However, both the spheres shown in Figure 7.11 contain the
same number of polygons, and yet one manages to look smooth. How can
this happen?
The answer is the use of a trick that fools the eye by smoothly varying
the shading within the polygonal facets. The point has already been made
that if you look at the outline of both spheres, you will see that they are
made from straight segments. In the case of the smooth-looking sphere, it
looks smooth because the discontinuities in shading between adjacent facets
have been eliminated. To the eye, a discontinuity in shading is much more
noticeable than a small angular change between two edges. This smooth
shading can be achieved using either the Phong or the Gouraud approach.
We shall now discuss these shading methods in more detail.
Gouraud Shading
Gouraud shading is used to achieve smooth lighting on polygon surfaces without the computational burden of calculating lighting for each fragment. To
do this, a two-step algorithm is used:
i
i
167
1. Calculate the intensity of the light at each vertex of the model; that is,
the computed colors at each vertex.
In Section 7.5.1, we derived equations to calculate the color intensity at
a point on the surface of a polygon. We use these equations to calculate
the color intensity at vertex i. The normal N used in these equations is
taken to be the average of the normals of each of the polygon surfaces
which are attached to the vertex. For example, to calculate the light
intensity at vertex i, as shown in Figure 7.12, we need to compute the
normal at vertex i. To do this, we need to average the surface normals
calculated for each of the attached polygons (j, j + 1, j + 2, j + 3 and
j + 4).
i
i
168
Figure 7.13. Calculate the light intensity at a raster pixel position within the projection
i
i
169
(j Y0 )(X1 X0 ) (i X0 )(Y1 Y0 )
,
,
Phong Shading
The Gouraud shading procedure works well for the diffuse reflection component of a lighting model, but it does not give a good result when specular
reflection is taken into account. Specular highlights occupy quite a small proportion of a visible surface, and the direction of the normal at precisely the
visible point must be known so that the specular reflection can be determined.
Thus, Phongs approach was to interpolate the surface normal over a polygon
rather than interpolate the light intensity. In the other aspects, the procedure
is similar to the Gouraud shading algorithm.
Phong shading is done in the following two stages:
1. Calculate a normal vector for each vertex of the model.
2. Use two-dimensional interpolation to determine the normal at any
point within a polygon from the normals at the vertices of the
polygon.
In Stage 1, an analogous procedure to the first step of the Gouraud algorithm is used. Figure 7.14 shows a cross section through a coarse (six-sided)
approximation of a cylinder. In Figure 7.14(a), the facet normals are illustrated, Figure 7.14(b) shows the vertex normals obtained by averaging normals from facets connected to a particular vertex and Figure 7.14(c) shows the
i
i
170
Figure 7.14. In Phong smoothing ,the surface normals are averaged at the vertices and
then the vertex normals are interpolated over the flat facets to give an illusion of a
continuously varying curvature.
(j Y0 )(X1 X0 ) (i X0 )(Y1 Y0 )
,
;
n
i,j =
ni,j
.
|ni,j |
i
i
7.6. Texturing
171
vector n given by
nx
(Y1 Y2 )n0x + (Y2 Y0 )n1x + (Y0 Y1 )n2x
ny = 1 (Y1 Y2 )n0y + (Y2 Y0 )n1y + (Y0 Y1 )n2y .
nz
(Y1 Y2 )n0z + (Y2 Y0 )n1z + (Y0 Y1 )n2z
Normals in the remaining fragments on scanline j covered by polygon k
are thus determined sequentially from
ni+1,j = ni,j + n;
ni+1,j
.
n
i+1,j =
|ni+1,j |
Once the lighting and shading calculations are completed, each fragment
can be written to the output frame buffer, which is then used to output our
rendered image onto the screen.
7.6
Texturing
i
i
172
7.6.1
Procedural Textures
7.6.2
Image Mapping
An image map performs a similar task to that of a procedural texture. However, image maps facilitate much greater control over the appearance of the
surface they are applied to. In essence, any picture may be used as the
source for the image map (2D artwork or scanned photographs are common
sources). This is particularly useful in enhancing the appearance of outdoor
environments and for product design applications where text, manufacturers
logos or labels can be added to a 3D model of their product.
The other major difference between a procedural texture and an image
map is that the surface color is not calculated on the basis of a mathematical
model. Rather, the surface color is obtained in what one might describe as
a data lookup procedure. The color to be used for the surface fragment is
obtained from one or a combination of the pixels that make up the image
map. It is here that the difficulty lies. We need to determine which pixel
within the image map corresponds to the surface fragment and then we need
to look it up. More mathematics!
If the surface fragment, centered on point p, lies on the triangular polygon
with vertices P0 , P1 and P2 then s, the color to be substituted into the lighting
model, is determined by executing the following steps. Figure 7.15 illustrates
the process.
i
i
7.6. Texturing
173
Figure 7.15. Mapping an image into a rectangle with texture coordinates of (0, 0) at
= X0 + (X1 X0 ) + (X2 X0 ),
= Y0 + (Y1 Y0 ) + (Y2 Y0 ).
(7.4)
(7.5)
i
i
174
There are two important issues that arise in Step 3. First, to obtain a valid
address in A, both i and j must simultaneously satisfy 0 i < n and 0 j <
m or, equivalently, scale to a unit square: 0 X < 1 and 0 Y < 1. Thus
the question arises as to how to interpret texture coordinates that fall outside
the range [0, 1]. There are three possibilities, all of which prove useful. They
are:
1. Do not proceed with the mapping process for any point with texture
coordinates that fall outside the range [0, 1]; just apply a constant color.
2. Use a modulus function to tile the image over the surface so that it is
repeated as many times as necessary to cover the whole surface.
3. Tile the image over the surface, but choose a group of four copies of
the image to generate blocks that themselves repeat without any seams,
i.e., a mosaic pattern. To generate a mosaic pattern, the and values
in Equations (7.4) and (7.5) are modified using a floor() function and
integer rounding.
The second issue is the rather discontinuous way of rounding the floatingpoint texture coordinates (X , Y ) to the integer address (i, j) (i = truncate(x)
and j = truncate(y)) used to pick the color from the image. A better approach
is to use a bilinear (or higher order) interpolation to obtain s, which takes into
account not only the color in image pixel (i, j) but also the color in the image
pixels adjacent to it. Thus, if the texture coordinates at p are (X , Y ), we
obtain s by using a bilinear interpolation between color values from Ai1,j ,
Ai+1,j , Ai,j1 and Ai,j+1 in addition to the color from Ai,j . Note: Bilinear
interpolation give good results when the image map is being magnified, but
in cases were the image is being reduced in size, as the object being mapped
moves away from the camera, for example, the use of a mipmap (see Moller
and Haines [8]) gives a better approximation.
7.6.3
Transparency Mapping
Transparency mapping follows the same steps as basic image mapping to deliver a color value, s, for use at a point p on polygon k. However, instead
of using s directly as a surface color, it is used to control a mixture between
the color settings for polygon k and a color derived from the next surface
recorded in the transparency buffer. (Remember that the Z-buffer algorithm
can be modified to hold several layers so that if any of them were transparent,
i
i
7.6. Texturing
175
the underlying surface would show through.) Note that when mixing the
proportion of s, due to the underlying surface, no lighting effects are applied;
that is, when you look through a window from inside a room, it is not the
light in the room that governs the brightness of the world outside.
7.6.4
Bump Mapping
This technique, introduced by Blinn [1], uses the color from an image map to
modulate the direction of the surface normal vector. To implement a bump
map, the change (gradient or derivative) of brightness across the image, from
one pixel to another, determines the displacement vector added to the surface
normal of polygon k at point p. Most of the analysis we need in order to
calculate the perturbing vector n has been covered already. Bump mapping
usually uses the same texture coordinates as are used for basic image mapping.
In fact, it is essential to do so in cases where an image and bump map are part
of the same material. For example, a brick surface can look very good with a
bump map used to simulate the effect of differential weathering on brick and
mortar.
To determine n, we need to find incremental vectors parallel to two
of the sides of polygon k; call them n1 and n2 . Do this by first finding
texture coordinates at points close to p (at p the texture coordinates are (X , Y )
given by Equations (7.4) and (7.5)). The easiest way to do this is to make
small increments, say and , in and and use Equations (7.4) and (7.5)
to obtain texture coordinates (Xr , Yr ) and (Xb , Yb ).
Before using the texture coordinates to obtain bump values from the
map, it is necessary to ensure that the distance in texture space between (X , Y )
and (Xr , Yr ) and between (X , Y ) and (Xb , Yb ) is small. To achieve this, write
t
Xr
(Xr X )2 + (Yr Y )2 ,
(Xr X )
= X+
,
Yr = Y +
(Yr Y )
t
i
i
176
7.6.5
Reflection Mapping
True reflections can only be simulated with ray tracing or other timeconsuming techniques, but for a lot of uses, a reflection map can visually satisfy all the requirements. Surfaces made from gold, silver, chrome and even
the shiny paint work on a new automobile can all be realistically simulated
with a skillfully created reflection map, a technique introduced by Blinn and
Newell [2]. An outline of a reflection map algorithm is:
1. With the knowledge that polygon k, at point p, is visible in fragment
(i, j) determine the reflection direction d of a line from the viewpoint
to p. (Use the procedure given in Section 6.5.)
2. Since the basic rendering algorithm transformed the scene so that the
observer is based at the origin (0, 0, 0) and is looking in direction
(1, 0, 0), the reflected direction d must be transformed with the inverse
of the rotational part of the viewing transformation. The necessary
inverse transformation can be obtained at the same time. Note: the
inverse transformation must not include any translational component,
because d is a direction vector.
i
i
7.6. Texturing
177
3. Get the Euler angles that define the direction of d. These are the heading , which lies in the range [, ], and pitch , which lies in the
range [ 2 , 2 ]. Scale the range of to cover the width Xmax of the
reflection image. Scale the range of to cover the height Ymax of the
reflection image:
X
( + )
Xmax ;
2
( 2 )
Ymax .
Figure 7.16. A reflection map can be thought of as an image painted on the inside of
7.6.6
The image mapping and procedural textures that we have just been discussing, and indeed the Phong shading and the lighting models, have all been
i
i
178
7.7
Summary
In this chapter, we have described the main stages in the process which turns
a numerical description of the virtual world into high-quality rendered images. Whole books have been written on this subject; Moller and Haines [8]
provide comprehensive detail of all the important aspects of rendering in realtime.
If you need to build into your VR visualization code, special features or
effects, it is likely that you will find an algorithm for it in the Graphics Gems
series [6], which is a rich source of reference material for algorithms associated with rendering and visualization in general. The new custom processors
specifically designed for graphics work (the GPUs) attract their own special
algorithms, and the two books in the GPU Gems series [5] provide a fascinating insight into how this hardware can be used to render effects and elements
of reality that we previously thought impossible.
The focus of this chapter has been on real-time rendering so that we can
bring our virtual worlds and objects to life, make them move and make them
interact with us. In the next few chapters, we will be building on this.
Bibliography
[1] J. Blinn. Jim Blinns Corner: A Trip Down the Graphics Pipeline. San Fransisco,
CA: Morgan Kaufmann, 1996.
i
i
Bibliography
179
i
i
Computer
Vision in VR
Computer vision is, in itself, a useful and interesting topic. It plays a significant role in many applications of artificial intelligence, automated patten
recognition. . . a full list of its applications would be too long to consider here.
However, a number of the key algorithms which emerged during the development of the science of computer vision have obvious and extremely helpful
roles in the implementation and delivery of practical VR applications. The
full detail and mathematical basis for all the algorithms we will be looking
at in this chapter have been rigorously and exquisitely explored by such notable authors as Faugaras [3] and Hartley and Zisserman [7], so if you want
to know more there is no better place to look. But remember, this is not a
topic for the faint of heart!
In the context of VR, the geometric aspects of computer vision are the
most significant and therefore the ones which we intend to introduce here.
For example, in an augmented reality videoconferencing environment [2],
the user can see images of the participants on virtual displays. The video
images on these virtual displays have to be distorted so as to take on their
correct perspective as the viewer moves around the virtual world.
To achieve results like these, one needs to study how the appearance of
objects changes as they are viewed from different locations. If you want to
make use of interactive video in a virtual world, you also need to consider the
effects of distortions in the video camera equipment. And of course all this
has to be done in real time. Get it right and you will be able to realistically
181
i
i
182
8. Computer Vision in VR
and seamlessly insert the virtual world into the real world or the real world
into the virtual. Get it wrong and you just might end up making your guests
feel sick.
In Chapter 6, we examined some of the properties of Euclidian geometry and introduced the concept of homogeneous coordinates in the context
of formulating the transformations of translation, scaling and rotation. In
that chapter, it appeared that the fourth row and fourth column in the
transformation matrices had minimal significance. It was also pointed out
that when visualizing a scene, the three-dimensional geometry had to
be projected onto a viewing plane. In a simplistic rendering application,
this is accomplished with two basic equations (Chapter 6, Equations (6.7)
and (6.7)) that implement a nonlinear transformation from a 3D space to a
2D space.
In human vision, the same nonlinear transformation occurs, since it is
only a 2D image of the 3D world around us that is projected onto the retina
at the back of the eye. And of course, the same projective transformation is
at work in all types of camera, both photographic and electronic. It should
therefore come as no surprise that in the science of computer vision, where
one is trying to use a computer to automatically recognize shapes in twodimensional images (as seen by the camera feeding its signal to the computer), the result of the projective transformation and the distortions that it
introduces are an unwelcome complication. Under a projective transformation, squares do not have to remain square, circles can become ellipses and
angles and distances are not generally preserved. Information is also lost during projection (such as being able to tell how far something is away from the
viewpoint), and therefore one of main goals in computer vision is to try to
identify what distortion has occurred and recover as much of the lost information as possible, often by invoking two or more cameras. (We normally
use two eyes to help us do this.)
Of course, this is not a book on computer vision, but some of its
ideas such as scene reconstruction using multiple viewpoints can help us build
our virtual worlds and interact with them in a realistic manner. So, in this
chapter, we will examine some of the basics of computer vision that let us
do just this.
In order to make any real progress in understanding computer vision,
how to correct for distortion and undo the worst aspects of the projective
transformation, it is first a good idea to introduce the mathematical notation
of the subject. Then we can use it to express and develop the algorithms that
we wish to use for VR.
i
i
8.1
183
i
i
184
8. Computer Vision in VR
i
i
8.2. Cameras
185
where X and X are points in the projective space and H is the (n+1)(n+1)
transformation matrix that relates them.
It is interesting to note that under a transformation, points at infinity in
n
P are mapped to other points arbitrarily; that is, they are not preserved as
points at infinity. Why are we concerned about these points at infinity? It is
because in computer vision, it is convenient to extend the real 3D world into
3D projective space and to consider the images of it, which form at the back
of the camera, as lying on a 2D projective plane. In this way, we can deal
seamlessly with lines and other artifacts in our image that are parallel, and we
do not have to single them out for special treatment.
8.2
Cameras
8.2.1
Camera Projections
In Section 6.6.7, we gave simplistic expressions for the act of projecting the
3D world onto the viewing plane using a camera. The equations introduced
there are actually quite subtle, and a deeper study can reveal some interesting
facts now that we have more information about the homogeneous coordinate
system and the projective spaces P2 and P3 .
For example, when using projective spaces and homogeneous coordinates,
it is possible to transform points from P3 to P2 using a transformation matrix
P:
x
X
Y = P(34) y ,
z
W
w
i
i
186
8. Computer Vision in VR
Figure 8.1. Projecting a 2D ground plane onto a 2D viewing plane, where both planes
lie in 3D space.
Figure 8.2. Different planes of projection may produce different images, but they all
represent the same visible part of a scene. Only if the camera center changes is the
view fundamentally altered.
i
i
8.2. Cameras
187
any plane passing through the same point on the principal axis could act as
the projection plane. The projection onto any of these planes is still related
by a projective transformation. Only if the camera center moves does the
view change fundamentally. The resulting planar images will look somewhat
different, but in the absence of any other knowledge about the scene, it is
hard to argue that one image is more correct than another. For example, if
a planar surface in the scene is parallel to the imaging plane then circles on
the plane will map to circles in the image, whereas in the case of a different
viewing plane, they would map to ellipses. The effect of imaging onto several
differently orientated projection planes is illustrated in Figure 8.2. Thus one
of the key parameters that describes the properties of a camera is the camera
center.
8.2.2
Camera Models
A fairly simplistic view of the imaging process was given in Section 6.6.7,
and without stating it explicitly, the camera model used there represents the
perfect pinhole camera (see Figure 8.3). In addition, in the previous section
we saw that we can transform points from P3 to P2 . The transformation used
actually represents the camera model. In the case of a perfect pinhole camera,
the camera model can be represented by the transformation matrix
f (1)
0 px 0
0
f (2) py 0 ,
0
0
1 0
(8.1)
where px and py are the pixel coordinates of the principal point and f (1) and
f (2) are the focal distance expressed in units of horizontal and vertical pixels.
The ratio f (1)/f (2) is the aspect ratio and is different from 1 if the pixels in
the imaging sensor are not square.
If we take a view from the camera center C along the principal axis, we
see that the principal point P lies at the dead center of the image plane (Figure 8.4). In different types of camera, the principal point may be off-center,
and the aspect ratio may not equal 1.
Therefore, if trying to extract accurate 3D information from the 2D image scene, it is necessary to derive an accurate camera matrix through the
process of camera calibration.
i
i
188
8. Computer Vision in VR
Figure 8.3. The ideal pinhole camera showing the camera (or optical) center C, the
principal point P, the principal axis direction and the image plane. The focal length
is f .
To pursue this idea a little bit further, if we write the action of the projection matrix (Equation (8.1)) for square pixels as
X = K [I |0]x,
where
f 0 px
K = 0 f py
0 0 1
and
1 0 0 0
[I |0] = 0 1 0 0 ,
0 0 1 0
then the square matrix K contains all of the information expressing the parameters of the camera and is called the camera calibration matrix. That is,
you will need to determine the pixel coordinates of the principal point in addition to the focal distance. In determining these factors, it is important to
remember that real optical systems do not act like the perfect pinhole camera.
That is, they do not form a point image from a point object; rather they form
an intensity distribution of the image. The resulting distortions or aberrations
can affect the quality of the image and must be corrected for.
i
i
189
The nonlinear distortions due to the lens systems and associated imperfections in the image sensing technology have implications for some interesting applications in VR, and so we will now examine more advanced ideas in
computer vision which involve removing camera distortion and perspective
distortion.
8.3
8.3.1
i
i
190
8. Computer Vision in VR
(a)
(b)
(c)
Figure 8.5. Effects of pincushion and barrel distortion on the original scene.
are harder and sometimes nearly impossible to correct in this manner. A second method is to correct the distortion using digital image processing techniques. This method is well researched with an abundance of documented
distortion correction techniques.
Several researchers have presented various mathematical models of image
distortion to find the model parameters required to complete the distortion
correction process. Common to most methods is the use of a test chart consisting of equidistant dots. Usually, several adjustments are required to the
position of the test chart or optical system so that the central point and orientation of the vertical and horizontal lines on the chart are as intended on the
captured image. This is a time-consuming process that has been eliminated
by some researchers [14] by creating distortion algorithms that are sensitive
to changes in both the horizontal and vertical axis.
Vijayan et al. [14] developed a mathematical model, based on polynomial mapping, which is used to map images from the distorted image space
to a corrected image space. The model parameters include the polynomial
coefficients, distortion center and corrected center. They also developed an
accurate program to extract the dot centers from the test chart. However,
Vijayan et al. have assumed that each corrected grid line in the horizontal or
vertical direction does not need to have the same slope, and therefore produced line fits that cannot be parallel. Zhang et al. [15] have improved upon
this technique so that all lines in any one direction (vertical or horizontal)
have the same slope, making them parallel.
We will build on the work in [14] and [15] to illustrate how antidistortion algorithms can be easily utilized when developing virtual environments to ensure the accuracy of the 3D scene is carried over into the 2D
image. In addition, we will only consider barrel distortion, as this is the type
most commonly found in cameras.
In general, methods that correct barrel distortion must calculate a distortion center and correct the radial component of the distortion. The distortion
i
i
191
correction problem involves two image spaces. These are the distorted space
and the corrected space of the image. The goal is to derive a method for
transforming the distorted image into a corrected image.
The approach taken to correct for distortion involves a test grid similar to
that shown in Figure 8.6(a). Figure 8.6(b) illustrates the original test pattern
after imaging. It is evident that distortion is present and needs to be corrected.
All the previously straight rows and columns have become curved towards the
center of the image. This is typical of barrel distortion, where points on the
extremity of the image are moved towards the center of distortion.
Distortion parameters are calculated to fit a line of dots on a captured
image. This method offers two advantages, in that the correction coefficients
are calculated in an automatic process and the calibration pattern does not
need to be exactly horizontal or vertical.
The center of distortion is a fixed point for a particular camera and, once
calculated, can be used for all images from that camera for a given focal length.
An accurate estimate of the center of distortion is essential for determining the
corresponding distortion correction parameters.
The lines of the standard test pattern that remain straight after imaging
must lie between adjacent rows and columns of opposite curvatures. The next
step is finding where exactly this center of distortion lies within this bounded
area. This is shown diagrammatically in Figure 8.7. The intersection of these
straight lines should then give the center of distortion.
To estimate the center of distortion, four best-fit polynomials (f (y) =
n
n=0 an x ) are defined, passing through the adjacent rows and columns of
i
i
192
8. Computer Vision in VR
i
i
193
distorted image space (X , Y ) and transfer the pixel value or color to the corrected space. In that way, we build up a new image from the distorted image.
The easiest way to do this is to find the absolute distance (r) from (X , Y )
to (Xc , Yc ) and the angle () that a line joining these two points makes with
the x-axis. Note (Xc , Yc ) is the center pixel coordinate for the corrected image. It can be anywhere in the image display frame; however, it is sometimes
simpler to assume that it is in the same position as (Xc , Yc ). Now we need to
convert this absolute distance to the corresponding distance in the distorted
image. To do that, we need to characterize the radial component of the distortion. Essentially, a set of N calibration points (ri , i = 0, 1, 2, N 1) needs
to be defined from the center of distortion outwards to characterize the radial
distortion. The first calibration point r0 is taken at the center of distortion.
Using these discrete points, a polynomial1 can be defined to approximate the
distortion.
Now the location of the pixel at (r , ) in the distorted image space can
be calculated. The polynomial defined for the calibration points is used as an
interpolation function to find the pixel location. That is,
r =
M
j=0
r
aj ( )j , = ,
where M is the degree of the polynomial, aj is the expansion of the coefficients, is the scale factor of the pixel height versus width and r is the pixel
distance in the undistorted image. It is then possible to calculate the distorted
pixel coordinates using both the distorted pixel distance and the polar angle:
X = Xc + r cos , Y = Yc + r sin .
The brightness or color value of the pixel closest to the location (X , X )
can be found directly from the distorted image. This value is then assigned
to the pixel in location (X , Y ) of the undistorted image. This process can be
repeated for each pixel in the undistorted or corrected image space, thereby
reproducing the calibrated image without the effects of distortion.
1
The degree of polynomial selected to best represent the distortion should be based upon
the error fit and the rate of change of error from the previous smaller degree of polynomial
fit. A compromise should always be made between computation time and accuracy, both of
which tend to increase with the degree of polynomial.
i
i
194
8. Computer Vision in VR
8.3.2
Figure 8.8. Perspective distortion in the view of the side of a building. If four points
can be identified that are known to be at the corners of a (known) rectangle (such
as a window) then they can be used to determine the elements of H . From that, a
perpendicular view of the side can be reconstructed (see Figure 8.9). The four points
that we will use to determine H are highlighted.
i
i
195
Ym
h10 X + h11 Y + h12
=
.
Wm
h20 X + h21 Y + h22
(8.3)
Figure 8.9. Perspective distortion removed by applying H to each pixel in Figure 8.8.
The points used to determine H are illustrated in their corrected positions. It should
be noted that the correction only applies to those parts of the building which lie in
the same plane. If a different plane is used for the correction, a different result is
observed (see Figure 8.10).
i
i
196
8. Computer Vision in VR
Figure 8.10. Perspective distortion using points in a different plane results in a differ-
ent correction.
of them, say h22 = 1. Thus for points (Xi , Yi ); 0 i 3 and their matches
(Xmi , Ymi ); 0 i 3, we get the linear equations
X0
0
X1
0
X2
0
X3
0
Y0
0
Y1
0
Y2
0
Y3
0
1
0
1
0
1
0
1
0
0
X0
0
X1
0
X2
0
X3
0
Y0
0
Y1
0
Y2
0
Y3
0
1
0
1
0
1
0
1
X0 Xm0
Ym0 X0
X1 Xm1
Ym1 X1
X2 Xm2
Ym2 x2
X3 Xm3
Ym3 X3
Y0 Xm0
Ym0 Y0
Y1 Xm
Ym1 Y1
Y2 Xm2
Ym2 Y2
Y3 Xm3
Ym3 Y3
h00
h01
h02
h10
h11
h12
h20
h21
Xm0
Ym0
Xm1
Ym1
Xm2
Ym2
Xm3
Ym3
(8.4)
This simple-looking (and first-guess) method to determine H , and with
it the ability to correct for perspective distortion, is flawed in practice because
we cannot determine the location of the four points with sufficient accuracy.
The discrete pixels used to record the image and the resulting numerical problems introduce quantization noise, which confounds the numerical solution
of Equation (8.4). To obtain a robust and accurate solution, it is necessary
to use many more point correspondences and apply one of the estimation
algorithms outlined in Section 8.3.4.
i
i
8.3.3
197
Figure 8.11. Matching image points on the projection plane to points in the 3D world.
i
i
198
8. Computer Vision in VR
separately and does not need to be included in the camera matrix. This
is the method we prefer to use.
All these parameters are encompassed in the 3 4 matrix we identified
before. That is
x
X
Y = P y ,
z
W
w
or in full we can write
X
p00 p01 p02 p03
Y = p10 p11 p12 p13
p20 p21 p22 p23
W
y .
z
w
If we multiply out the terms of this matrix and decide, as is normal, that
W = 1, then we have
X = p00 x + p01 y + p02 z + p03 w,
Y = p10 x + p11 y + p12 z + p13 w,
1 = p20 x + p21 y + p22 z + p23 w.
The third equation in this sequence is interesting because if its value needs to
be equal to 1 then the term p23 must be dependent on the values of p20 , p21
and p22 . This means that the complete camera matrix contains 11 independent values.
As we know already, every point correspondence affords two equations
(one for x and one for y). So 5 12 point correspondences are needed between
the real-world image coordinates and the image coordinates in order to derive the camera matrix rather than derive it directly from camera calibration
methods. If 5 12 correspondences are used then the solution is exact.
The problem with this method is that if six or more correspondences are
used, we dont find that some correspondences simply drop out of the calculation because they are linear combinations of others. What we find is that
we obtain quite wide variations in the numerical coefficients of P depending
on which 5 12 are chosen for the calculation. This is due to pixel quantization
i
i
199
noise in the image and errors in matching the correspondences. Resolving the difficulty is a major problem in computer vision, one that is the subject of current and
ongoing research.
Intuitively, it would seem evident that if the effect of noise and errors are
to be minimized then the more points taken into account, the more accurately
the coefficients of the camera matrix could be determined. This is equally
true for the coefficients of a homography between planes in P2 and for the
techniques that use two and three views to recover 3D structure. A number
of very useful algorithms have been developed that can make the best use of
the point correspondences in determining P etc. These are generally referred
to as optimization algorithms, but many experts who specialize in computer
vision prefer the term estimation algorithms.
8.3.4
Estimation
So, the practicalities of determining the parameters of the camera matrix are
confounded by the presence of noise and error. This also holds true for perspective correction and all the other procedures that are used in computer
vision. It is a major topic in itself to devise, characterize and study algorithms
that work robustly. By robustly, we mean algorithms that are not too sensitive to changes in noise patterns or the number of points used and can make
an intelligent guess at what points are obviously in gross error and should be
ignored (outliers).
We will only consider here the briefest of outlines of a couple of algorithms that might be employed in these circumstances. The direct linear
transform (DLT) [7, pp. 8893] algorithm is the simplest; it makes no attempt to account for outliers. The random sample consensus (RANSAC) [5]
and the least median squares (LMS) algorithms use statistical and iterative
procedures to identify and then ignore outliers.
Since all estimation algorithms are optimization algorithms, they attempt
to minimize a cost function. These tend to be geometric, statistical or algebraic. To illustrate the process, we consider a simple form of the DLT
algorithm to solve the problem of removing perspective distortion. (Using
four corresponding points (in image and projection), we obtained eight linear
inhomogeneous equations (see Equation (8.4)), from which then the coefficients of H can be estimated.)
If we can identify more than four matching point correspondences, the
question is then how to use them. This is where the DLT algorithm comes
in. To see how to apply it, consider the form of the matrix in Equation (8.4)
i
i
200
8. Computer Vision in VR
xi yi 1 0 0 0 xi xi
0 0 0 xi yi 1 yi xi
yi xi
yi yi
h00
h01
h02
h10
h11
h12
h20
h21
= xi .
yi
(8.5)
h00
h01
h02
h10
0
0
0
xi xi yi xi xi wi
wi xi wi yi wi wi
h11
= 0.
0
0
0
wi xi wi yi wi wi yi xi yi yi yi wi
h12
h20
h21
h22
(8.6)
For n points in this homogeneous form, there is now a 2n 9 matrix,
but solutions of the form Ah = 0 are less subject to instabilities than the
i
i
201
1 0
6
3
6
6
6
6
0
66
2
2
3
3
33
33
3 0
0 1
1 0
2
2
2
2
2
2
22
.
There are two steps in the basic direct linear transform algorithm. They
are summarized in Figure 8.12. The DLT algorithm is a great starting point
for obtaining a transformation that will correct for a perspective projection,
but it also lies at the heart of more sophisticated estimation algorithms for
solving a number of other computer vision problems.
We have seen that the properties of any camera can be concisely specified
by the camera matrix. So we ask, is it possible to obtain the camera matrix
itself by examining correspondences from points (not necessarily lying on a
plane) in 3D space to points on the image plane? That is, we are examining
correspondences from P3 to P2 .
Of course the answer is yes. In the case we have just discussed, H was a
3 3 matrix whereas P is dimensioned 3 4. However, it is inevitable that
the estimation technique will have to be used to obtain the coefficients of P.
Furthermore, the only differences between the expressions for the coefficients
of H (Equation (8.6)) and those of P arise because the points in P3 have four
coordinates, and as such we will have a matrix of size 2n 12. Thus the
analog of Equation (8.6) can be expressed as Ap = 0, and the requirement
for the estimation to work is n 6. (That is, 6 corresponding pairs with 12
coefficients and 11 degrees of freedom.)
i
i
202
8. Computer Vision in VR
Figure 8.12. Obtaining the matrix H using the DLT algorithm with n point correspondences. In practice, the DLT algorithm goes through a pre-processing step
of normalization, which helps to reduce numerical problems in obtaining the SVD
(see [7, pp. 8893]).
= [xi , yi , zi , wi ],
Wi xTi
0
0
Wi xTi
Xi xTi
Yi xTi
p0i
p1 = 0.
i
p2i
(8.7)
i
i
203
8.3.5
(a)
(d)
(b)
(e)
(c)
(f)
i
i
204
8. Computer Vision in VR
simple assumptions regarding vanishing points and the line at infinity have
to be accepted.) Figure 8.13(c) shows image maps recovered from the picture and corrected for perspective. Figure 8.13(d) is a different view of the
mesh model. Figure 8.13(e) is a 3D rendering of the mesh model with image
maps applied to surfaces. Figure 8.13(f ) shows the scene rendered from a
low-down viewpoint. (Note: some minor user modifications were required to
fix the mesh and maps after acquisition.) The details of the algorithms that
achieve these results are simply variants of the DLT algorithm we examined
in Section 8.3.4.
In an extension to this theory, it is also possible to match up several adjoining images taken by rotating a camera, as illustrated in Figure 8.14. For
example, the three images in Figure 8.15 can be combined to give the single
image shown in Figure 8.16. This is particularly useful in situations where
you have a wide display area and need to collate images to give a panoramic
view (see Figure 8.17). A simple four-step algorithm that will assemble a
single panoramic composite image from a number of images is given in Figure 8.18.
Figure 8.14. A homography maps the image planes of each photograph to a reference
frame.
Figure 8.15. Three single photos used to form a mosaic for part of a room environ-
ment.
i
i
205
i
i
206
8. Computer Vision in VR
Figure 8.18. The algorithm to assemble a panoramic mosaic from a number of individual images.
8.3.6
So far, we have really only considered scene reconstruction using one camera.
However, when a second, third or fourth camera becomes available to take
simultaneous pictures of the 3D world, a whole range of exciting opportunities open up in VR. In particular, the realm of 3D reconstruction is very
important.
Again, the theory of scene reconstruction from multiple cameras is based
on rigorous maths. We will only briefly explore these ideas by introducing
epipolar geometry. This arises as a convenient way to describe the geometric
relationship between the projective views that occur between two cameras
looking at a scene in the 3D world from slightly different viewpoints. If the
geometric relationship between the cameras is known, one can get immediate
access to 3D information in the scene. If the geometric relationship between
the cameras is unknown, the images of test patterns allow the camera matrices
to be obtained from just the two images alone.
i
i
207
All of the information that defines the relationship between the cameras
and allows for a vast amount of detail to be retrieved about the scene and
the cameras (e.g., 3D features if the relative positions and orientations of the
cameras is known) is expressed in a 44 matrix called the fundamental matrix.
Another 4 4 matrix called the essential matrix may be used at times when
one is dealing with totally calibrated cameras. The idea of epipolar geometry
is illustrated in Figure 8.19(a), where the two camera centers C and C and
the camera planes are illustrated.
Any point x in the world space projects onto points X and X in the
image planes. Points x, C and C define a plane S. The points on the image
plane where the line joining the camera centers intersect the image planes are
called the epipoles. The lines of intersection between S and the image planes
are called the epipolar lines, and it is fairly obvious that for different points x
there will be different epipolar lines, although the epipoles will only depend
on the camera centers and the location and orientation of the camera planes
(see Figure 8.19(b)).
Figure 8.19. (a) Point correspondences and epipolar relationships. (b) Variations of
epipolar lines. Point x1 gives rise to epipolar lines l1 and l1 , point x2 to l2 and l2 etc.
Figure 8.20 shows how the epipolar lines fan out from the epipoles as x
moves around in world space; this is called the epipolar pencil.
Now, what we are looking for is a way to represent (parameterize) this
two-camera geometry so that information can be used in a number of interesting ways. For example, we should be able to carry out range finding if we
know the camera geometry and the points X and X on the image planes, or if
we know something about the world point, we can determine the geometric
relationship of the cameras.
i
i
208
8. Computer Vision in VR
Figure 8.20. Epipole and epipolar lines as they appear in the camera plane view.
In Section 8.3.3, all the important parametric details about a camera were
summed up in the elements of the 3 3 matrix, P. It turns out that an exact
analog exists for epipolar geometry; it is called the fundamental matrix F .
Analogous to the way P provided a mechanism of mapping points in 3D
space to points on the cameras image plane, F provides a mapping, not from
P3 to P2 , but from one cameras image of the world point x to the epipolar
line lying in the image plane of the other camera (see Figure 8.21).
This immediately gives a clue as to how to use a two-camera set-up to
achieve range finding of points in world space if we know F :
Figure 8.21. A point projected onto the image plane of camera 1 is mapped to some
point lying on the epipolar line in the image plane of camera 2 by the fundamental
matrix F .
i
i
209
Take the projected image of a scene with two cameras. Use image
processing software to identify key points in the picture from image 1. Apply F to these points in turn to find the epipolar lines
in image 2 for each of these points. Now use the image processing software to search along the epipolar lines and identify the
same point. With the location in both projections known and
the locations and orientations of the cameras known, it is simply down to a bit of coordinate geometry to obtain the world
coordinate of the key points.
But there is more: if we can range find, we can do some very interesting
things in a virtual world. For example, we can have virtual elements in a
scene appear to interact with real elements in a scene viewed from two video
cameras. The virtual objects can appear to move in front of or behind the real
objects.
Lets return to the question of how to specify and determine F and how
to apply it. As usual, we need to do a bit of mathematics, but before we do,
let us recap some vector rules that we will need to apply:
Rule 1: The cross product of a vector with itself is equal to zero. For
example, X3 X3 = 0.
Rule 2: The dot product of a vector with itself can be rewritten as
the transpose of the vector matrix multiplied by itself. For example,
X3 X3 = XT
3 X3 .
Rule 3: Extending Rule 1, we can write X3 (X3 X3 ) = 0.
Rule 4: This is not so much of a rule as an observation. The cross
product of a vector with another vector that has been transformed by a
matrix can be rewritten as a vector transformed by a different matrix.
For example, X3 T X3 = MX3 .
Remembering these four rules and using Figure 8.22 as a guide, let X3 =
[x, y, z, w]T represent the coordinates in P3 of the image of a point in the first
camera. Using the same notation, the image of that same point in the second
camera is at X3 = [x , y , z , w]T , also in world coordinates. These positions
are with reference to the world coordinate systems of each camera. After
projection onto the image planes, these points are at X = [X , Y , W ]T and
X = [X , Y , W ]T , respectively, and these P2 coordinates are again relative
to their respective projection planes. (Note: X3 and X represent the same
i
i
210
8. Computer Vision in VR
point, but X3 is relative to the 3D world coordinate system with its origin at
C, whereas X is a 2D coordinate relative to an origin at the bottom left corner
of the image plane for camera 1.)
Figure 8.22. Two-camera geometry; the second camera is positioned relative to the
Now, assuming that the properties of the cameras are exactly the same,
apply the necessary translation and rotation to the first camera so that it lies
at the same location and points in the same direction as the second camera.
Then X3 will have been moved to X3 , and thus
X3 = T X3 ,
where T is a transformation matrix that represents the rotation and translation of the frame of origin of camera 1 to that of camera 2.
Now if we consider Rule 3, which stated
X3 (X3 X3 ) = 0,
then we can replace one of the X3 terms with TX3 . That is,
X3 (X3 T X3 ) = 0.
Utilizing Rule 2, we can rewrite this equation as
XT
3 (X3 T X3 ) = 0,
(8.8)
i
i
211
where the matrix M is derived from the translation matrix. Using the camera
calibration matrices P and P to project the world points X3 and X3 onto the
viewing planes, X = PX3 and X = P X3 , Equation (8.8) becomes
(P 1 X )T M(P 1 X) = 0,
or
XT ((P 1 )T MP 1 X) = 0.
(8.9)
Xi Yi
f
f
f
X
00
01
02
i
Wi f10 f11 f12 Yi = 0.
f20 f21 f22
Wi
Like the P3 to P2 homology H , the fundamental matrix has only eight independent coefficients.
There are a large number of other interesting properties that emerge from
a deeper analysis of Equation (8.9), but for the purposes of this introductory
chapter and in the context of VR, we simply refer you again to [3] and [7] for
full details.
8.3.7
Section 8.3.6 hinted that by using the concept of epipolar geometry, two pictures taken from two different viewpoints of a 3D world can reveal a lot more
about that world than we would obtain from a single picture. After all, we
have already discussed that for VR, work stereopsis is an important concept,
and stereoscopic projection and head-mounted displays are an essential component of a high quality VR environment. Nevertheless, one is still entitled
to ask if there are any advantages in having an even greater number of views
i
i
212
8. Computer Vision in VR
of the world. Do three, four or indeed n views justify the undoubted mathematical complexity that will result as one tries to make sense of the data and
analyze it?
For some problems, the answer is of course yes, but the advantage is less
pronounced from our viewpoint, i.e., in VR. We have two eyes, so the step in
going from a single view to two views is simply bringing the computer-based
virtual world closer to our own sense of the real world. To some extent, we
are not particularly interested in the aspect of computer vision that is more
closely allied to the subject of image recognition and artificial intelligence,
where these multi-views do help the stupid computer recognize things better
and reach more human-like conclusions when it looks at something.
For this reason, we are not going to delve into the analysis of three-view
or n-view configurations. It is just as well, because when three views are
considered, matrices (F for example) are no longer sufficient to describe the
geometry, and one has to enter the realm of the tensor and its algebra and
analysis. Whilst tensor theory is not an impenetrable branch of mathematics,
it is somewhat outside the scope of our book on VR. In terms of computer
vision, when thinking about three views, the important tensor is called the
trifocal tensor. One can read about it from those who proposed and developed
it [6, 13].
8.3.8
If you think about it for a moment, making a video in which the camera
moves from one spot to another effectively gives us at least two views of a
scene. Of course, we get a lot more than two views, because even if the
journey takes as short a time as one second, we may have 25 to 30 pictures,
one from each frame of the movie. So on the face of it, taking video or
moving footage of a scene should be much more helpful in extracting scene
detail than just a couple of camera snapshots.
Unfortunately, there is one small complication about using a single camera to capture scene detail (and we are not talking about the fact that we dont
move the camera, or that we simply pan it across the scene, which is no better). We are talking about the fact that elements in the scene get the chance
to move. Moving elements will make the recognition of points much more
difficult. For example, in using the fundamental matrix to calculate range, we
cannot be assured that if we take two pictures of a scene and select a point
of interest in one picture (use F to find its epipolar line in the second), the
image of the point in the second picture will still be lying on the epipolar line.
i
i
213
Figure 8.23. Three images taken from a video clip are used by an SfM process to
i
i
214
8. Computer Vision in VR
8.4
Software Libraries
for Computer Vision
i
i
215
i
i
216
8. Computer Vision in VR
Figure 8.24. A distinctive marker in the field of view of the camera provides a frame of
reference that is used to place virtual objects in the scene with a size and orientation
calculated to make them look like they are attached to the marker.
8.5
Summary
In this chapter, we have given you a quick run through of some of the concepts of computer vision and how they might be applied to enhance and
i
i
Bibliography
217
Bibliography
[1] The Augmented Reality Toolkit. http://www.hitl.washington.edu/artoolkit/.
[2] M. Billinghurst and H. Kato. Real World Teleconferencing. In CHI 99 Extended Abstracts on Human Factors in Computing Systems, pp. 194195. New
York: ACM Press, 1999.
[3] O. Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint.
Cambridge, MA: The MIT Press, 1993.
[4] O. Faugeras and T. Papadopoulo. A Nonlinear Method for Estimating the
Projective Geometry of 3 Views. In Proceedings of the Sixth International Conference on Computer Vision, pp. 477484. Wshington D.C.: IEEE Computer
Society, 1998.
[5] M. Fischler and R. Bolles. Random Sample Consensus: A Paradigm for Model
Fitting with Applications to Image Analysis and Automated Cartography.
Communications of the ACM 24:6 (1981) 381395.
i
i
218
8. Computer Vision in VR
i
i
Image-Based
Rendering
Image-based rendering (IBR) techniques are a powerful alternative to traditional geometry-based techniques for creating images. Instead of using basic
geometric primitives, a collection of sample images are used to render novel
views.
In Chapter 5, we discussed these traditional techniques and looked
at how 3D scenes constructed from polygonal models were stored in databases. Then, in Chapter 7, which explored the 3D rendering pipeline, we
made the implicit assumption that the view one sees in a 3D scene arises as
the result of rendering these polygonal models. And for the majority of VR
applications, this assumption is probably true. However, IBR offers us an
alternative way to set up and visualize a virtual world that does not involve
rendering polygons and doesnt require any special 3D accelerated hardware.
Fortunately, we have already touched upon the techniques of IBR in Section 8.3.5 when we discussed panoramic image composition, and we also
covered one of its most fundamental ideas (that is, removing perspective distortion) in Section 8.3.2.
Of course, there is a lot more to IBR than just allowing one to render
views of a virtual world made up of a collection of photographs as opposed
to a collection of polygonal models. In this chapter, we will allow ourselves a
little indulgence to briefly explore some of these exciting ideas before focusing
primarily on those aspects of IBR most relevant to VR. In some situations,
IBR is a much more relevant technology to use for VR because it actually
allows us to use photographs of real places, and this can make a virtual scene
seem anything but virtual!
219
i
i
220
9. Image-Based Rendering
9.1
i
i
221
Figure 9.2. The viewing frustum for rendering from an environment map. In (a), a
cubic environment is shown, and in (b) a spherical one. Note the distortion on the
spherical map and the requirement to use parts of two or three sides of the cube.
method of capturing image panoramas has been used to provide environment maps for a number of the sample programs in Direct3D SDK
(which we discuss in Chapter 13). Both these methods render their
output by working from a panoramic image stitched together from a
collection of photographs. Rendering can be done by traditional spherical or cubic environment maps (see Figure 9.2), which can take advantage of hardware acceleration if its available, or by a highly optimized
image texture lookup function using direction of view input parameters. More interesting perhaps is how the panoramas are produced. We
will look at these methods in more detail shortly.
2. Rendering with implicit geometry. This class of methods gets its name
because it uses inferred geometric properties of the scene which arise
i
i
222
9. Image-Based Rendering
Figure 9.3. Viewmorph. Original images I0 and I1 are projected on a plane at I0 and
I1 respectively. Both these images lie in the same plane, and an interpolation between
them will give an image Ii . If required, a reprojection to the image plane Ii can be
made.
i
i
223
Figure 9.4. Depth rendering. When the distance from the camera to an objects
surface is known, a 3D model can be constructed. This can be rendered from a
different direction, but note that points on the objects surface which are not visible
from one camera location may be visible from another, and since they cant be seen
from the location where the range is detected, holes will result in the alternate view.
In VR, the most useful IBR methods are the 2D no geometry plenoptic functions of item 1. In the rest of this chapter, we will look at how the
panoramic images they use are produced and applied in real-time virtual environment navigation applications.
9.2
One of the trickiest things to do in IBR is acquire the images that are to be
used to construct the panoramic view or omnidirectional object views rendered in response to a specified direction of view. Ultimately, one would
really like to be able to construct these by using a handheld video camera and
making a rough scan of the complete enveloping environment. This would
require significant analytical effort in the subsequent image processing stage,
termed mosaicing. Alternatively, special-purpose imaging hardware has been
developed, such as that by Matusik et al. [11]. They use this hardware to
acquire images of objects under highly controlled lighting conditions. Equipment of this type is illustrated in Figure 9.5(a).
A camera mounted on a tripod can be used to obtain 360 panoramas
(the concept of which is illustrated in Figure 8.14). However, problems arise
if one tries to capture images directly overhead. In this case, the production
of the complete panorama can be challenging. To increase the vertical field of
view, a camera can be mounted sideways. This was suggested by Apple Computer as one way to prepare a panoramic movie for use in QuickTime VR
i
i
224
9. Image-Based Rendering
Figure 9.5. (a) Scanning a model with multiple cameras and light sources, after [11].
(b) Panoramic camera mounted on its side for wide vertical field of view. (c) The
Omnicam with parabolic mirror. (d) A Catadioptric camera derived from the Omnicam. (Photograph of the Omnicam and Catadioptric camera courtesy of Professor
Nayar of Columbia University.)
(see Figure 9.5(b)). QuickTime VR can also be used in object movie mode, in
which we look inward at an object rather than outward towards the environment. To make a QuickTime VR object movie, the camera is moved in an
equatorial orbit round a fixed object whilst taking pictures every 10 . Several
more orbits are made at different vertical heights.
Many other ingenious methods have been developed for making panoramic images. In the days before computers, cameras that used long film strips
would capture panoramic scenes [14]. Fish-eye lenses also allow wide-angle
panoramas to be obtained. Mirrored pyramids and parabolic mirrors can also
capture 360 views, but these may need some significant image processing to
remove distortion. 360 viewing systems are often used for security surveillance, video conferencing (e.g., Behere [2]) and devices such as the Omnicam
(see Figure 9.5(c) and other Catadioptric cameras.) [6] from Columbia or
that used by Huang and Trivedi for research projects in view estimation in
automobiles [9].
9.3
Once acquired, images have to be combined into a large mosaic if they are
going to be used as a panoramic source for IBR. If the images are taken from a
i
i
225
Figure 9.6. (a) Input images; (b) combined images; (c) transform in two steps. Image 1
is being blended with image 2. Under the transform defined by the quadrilaterals,
point p in image 1 is moved to point p in the coordinate system used by image 2.
Other points map according to the same transformation.
i
i
226
9. Image-Based Rendering
set produced by camera rotation, the term stitching may be more appropriate,
as sequential images usually overlap in a narrow strip. This is not a trivial task,
and ideally one would like to make the process fully automatic if possible.
If we can assume the images are taken from the same location and with
the same camera then a panorama can be formed by reprojecting adjacent
images and then mixing them together where they overlap. We have already
covered the principle of the idea in Section 8.3.2; the procedure is to manually
identify a quadrilateral that appears in both images. Figure 9.6(a) illustrates
images from two panoramic slices, one above the other, and Figure 9.6(b)
shows the stitched result. The quadrilaterals have been manually identified.
This technique avoids having to directly determine an entire camera model,
and the equation which describes the transformation (Equation (8.4)) can
be simplified considerably in the case of quadrilaterals. As shown in Figure 9.6(c), the transformation could be done in two steps: a transform from
the input rectangle to a unit square in (u, v) space is followed by a transform
from the unit square in (u, v) to the output quadrilateral. This has the advantage of being faster, since Equation (8.4) when applied to step 2 can be solved
symbolically for the hij . Once the hij have been determined, any pixel p in
image 1 can be placed in its correct position relative to image 2, at p .
Of course, the distortion of the mapping means that we shouldnt just do
a naive copy of pixel value from p to p . It will be necessary to carry out some
form of anti-alias filtering, such as a simple bilinear interpolation from the
neighboring pixels to p in image 1 or the more elaborate elliptical Gaussian
filter of Heckbert [8].
Manually determining the homogeneous mapping between images being
mosaiced is fine for a small number of images, but if we want to combine a
Figure 9.7. The planar mosaic on the right is produced by automatically overlapping
and blending a sequence of images. Two of the input images are shown on the left.
i
i
227
over all corresponding pixels i. Once the hij coefficients have been determined, the images are blended together using a bilinear weighting function
that is stronger near the center of the image. The minimization algorithm
uses the Levenberg-Marquardt method [15]; a full description can be found
in [19]. Once the relationship between the two images in the panorama has
been established, the next image is loaded and the comparison procedure begins again.
Szeliski and Shum [20] have refined this idea to produce full-view panoramic image mosaics and environment maps obtained using a handheld digital camera. Their method also resolves the stitching problem in a cylindrical
panorama where the last image fails to align properly with the first. Their
method will still work so long as there is no strong motion parallax (translational movement of the camera) between the images. It achieves a full view1
by storing the mosaic not as a single stitched image (for example, as depicted
in Figure 9.6), but as a collection of images each with their associated positioning matrices: translation, rotation and scaling. It is then up to the viewing
software to generate the appropriate view, possibly using traditional texture
mapping onto a polygonal surface surrounding the viewpoint, such as a subdivided dodecahedron or tessellated globe as depicted in Figure 9.8. For each
triangular polygon in the sphere, any panoramic image that would overlap
it in projection from the center of the sphere is blended together to form a
composite texture map for that polygon. Texture coordinates are generated
for the vertices by projecting the vertices onto the texture map.
1
The method avoids the usual problem one encounters when rendering from panoramic
images, namely that of rendering a view which looks directly up or directly down.
i
i
228
9. Image-Based Rendering
Figure 9.8. Mapping onto polygonal spherical shapes allows a full-view panoramic
Figure 9.9. Building up a panoramic mosaic by moving the next image In into po-
sition via a rotation about the optical center. Once in position, In is blended with
the growing mosaic Ic . A least squares minimization algorithm which matches the
overlap between In and Ic is used to determine the optimal rotation matrix Rn .
i
i
229
9.3.1
QuickTime VR
i
i
230
9. Image-Based Rendering
type of linear media indexed with time, video and audio being the most well
known. A track can contain compressed data that requires its own player. The
QuickTime VR (QTVR) standard adds two new types of track. The first type
of track provides an inward-looking (or object) view which allows an object
to be looked at from any orientation. The second track holds a collection
of outward-looking environment panorama views. Appropriate interactive
software gives the user the impression of control over the view and viewpoint.
In the object format, the movie looks conventional, with each frame containing a different view of the photographed object. It is the duty of the player
to find and display the correct frame in the movie given the required viewing
direction. Extra frames can be added to allow for animation of the object.
In the panoramic format (see Figure 9.10), a number of panoramic images
are stored in a conventional movie track. These views give a 360 view from
several locations; it could be in different rooms of a virtual house or the same
view at different times. The player can pan around these and zoom in and
out. To allow the user to select a specific image, possibly by clicking on a
hotspot in the image, additional data can be added to the tracks. For example,
the panoramic track may contain data called nodes which correspond to points
in space. The nodes identify which panoramic image is to be used at that
point in space and identify how other panoramic images are selected as the
user moves from node to node. Hotspot images may also be mixed in to
give the appearance of being attached to objects in the panoramic view, say to
provide some information about an object, for example exhibits in a virtual
museum.
i
i
9.4. Summary
231
The panoramic images, which tend to be rather large, are usually divided
into smaller sub-images that are nearer the size of conventional video frames.
When playing the QTVR movie, there is no need for the frames to be stored
in any particular order. On playback, only those sub-images that lie in the
field of view need to be loaded from the file, so little RAM storage is usually
needed to interact with a QTVR movie. In some players, image caching may
be used to aid accelerated display. The final stage of rendering is accomplished
by a patented cylindrical to planar image warp [4] which renders the view
onto the flat screen.
QTVR is not the only panoramic image viewer, but it is the only one
tightly integrated into a major operating system. Other panoramic VR viewing image preparation software packages are relatively common and commercially available. Some have been particularly designed for use on the Web. As
mentioned before, realtor websites abound with virtual tours of their property
portfolios.
9.4
Summary
In this chapter, we looked at the three classical image-based rendering techniques: rendering with no geometry, rendering with implicit geometry and
finally rendering with explicit geometry. However, regardless of which technique is used, the main advantage that IBR has to offer over traditional rendering techniques based on polygonal models is its ability to render a scene
or object in great detail. The fact that the rendering time is independent of
the complexity of the scene is another significant plus. Before 3D-accelerated
graphics processors became commonplace, IBR was the only way to achieve
high-quality real-time graphics. Parallax effects, virtual tours and varying
lighting conditions can all now be simulated with IBR. This has great utility
in many different types of applications, one of the most common being archaeological walkthroughs where it is imperative to maintain the detail of the
real environment. Here, many people will get to experience a virtual archaeological site, where only a limited number of people would have access to the
real site.
The only real negative in IBR is the loss of flexibility in choosing viewing
conditions. Where IBR really excels is in producing environmental backgrounds and maps for use as part of a scene described by polygons and rendered in real time with a GPU. So, even if QuickTime VR fades from use,
the production of panoramic image maps will still be an essential element of
VR work, and as such, IBR will retain its importance to the VR community.
i
i
232
9. Image-Based Rendering
Bibliography
[1] E. Adelson and J. Bergman. The Plenoptic Function and the Elements of
Early Vision. In Computational Models of Visual Processing, M. Landy and J.
Movshon (editors), pp. 320. Cambridge, MA: The MIT Press, 1991.
[2] Be Here Corp. TotalView. http://www.behere.com/.
[3] S. Chen. QuickTime VR: An Image-Based Approach to Virtual Environment
Navigation. In Proceedings of SIGGRAPH 95, Computer Graphics Proceedings,
Annual Conference Series, edited by R. Cook, pp. 2938. Reading, MA: Addison
Wesley, 1995.
[4] S. Chen and G. Miller. Cylindrical to Planar Image Mapping Using Scanline
Choerence. United States Patent no. 5,396,583. 1995.
[5] S. Chen and L. Williams. View Interpolation for Image Synthesis. In Proceedings of SIGGRAPH 93, Computer Graphics Proceedings, Annual Conference
Series, edited by J. Kajiya, pp. 279288. New York: ACM Press, 1993.
[6] Columbia University, Computer Vision Laboratory. Applications of 360 Degree Cameras. http://www1.cs.columbia.edu/CAVE//projects/app cam 360/
app cam 360.php.
[7] S. Gortler et al. The Lumigraph. In Proceedings of SIGGRAPH 96, Computer
Graphics Proceedings, Annual Conference Series, edited by H. Rushmeier, pp. 43
54. Reading, MA: Addison Wesley, 1996.
[8] P. Heckbert. Filtering by Repeated Integration. Computer Graphics 20:4
(1986) 317321.
[9] K. Huang and M. Trivedi. Driver Head Pose and View Estimation with Single
Omnidirectional Video Stream. In Proceedings of the 1st International Workshop
on In-Vehicle Cognitive Computer Vision Systems, pp. 4451. Washington, D.C.:
IEEE Computer Society, 2003.
[10] C. Kuglin and D. Hine. The Phase Correlation Image Alignment Method.
In Proceedings of the IEEE Conference on Cybernetics and Society, pp. 163165.
Washington, D.C.: IEEE Computer Society, 1975.
[11] W. Matusik et al. Image-Based 3D Photography using Opacity Hulls. Proc.
SIGGRAPH 02, Transactions on Graphics 21:3 (2002) 427437.
[12] L. McMillan. An Image-Based Approach to Three-Dimensional Computer
Graphics. Ph.D. Thesis, University of North Carolina at Chapel Hill, TR97013, 1997.
i
i
Bibliography
233
[13] L. McMillan and G. Bishop. Plenoptic Modelling: An Image-Based Rendering System. In Proceedings of SIGGRAPH 95, Computer Graphics Proceedings,
Annual Conference Series, edited by R. Cook, pp. 3946. Reading, MA: Addison Wesley, 1995.
[14] J. Meehan. Panoramic Photography. New York: Amphoto, 1990.
[15] W. Press et al. (editors). Numerical Recipes in C++: The Art of Scientific Computing, Second Edition. Cambridge, UK: Cambridge University Press, 2002.
[16] S. Seitz and C. Dyer. View Morphing. In Proceedings of SIGGRAPH 96, Computer Graphics Proceedings, Annual Conference Series, edited by H. Rushmeier,
pp. 2130. Reading, MA: Addison Wesley, 1996.
[17] H. Shum and S Kang. A Review of Image-Based Rendering Techniques. In
Proceedings of IEEE/SPIE Visual Communications and Image Processing (VCIP)
2000, pp. 213. Washington, D.C.: IEEE Computer Society, 2000.
[18] H. Shum and R. Szeliski. Panoramic Image Mosaics. Technical Report MSRTR-97-23, Microsoft Research, 1997.
[19] R. Szeliski. Image Mosaicing for Tele-Reality Applications. Technical Report
CRL 94/2, Digital Equipment Corp., 1994.
[20] R. Szeliski et al. Creating Full View Panoramic Image Mosaics and TextureMapped Models. In Proceedings of SIGGRAPH 97, Computer Graphics Proceedings, Annual Conference Series, edited by T. Whitted, pp. 251258. Reading,
MA: Addison Wesley, 1997.
i
i
10
Stereopsis
One of the main things that sets a VR environment apart from watching TV
is stereopsis, or depth perception. Some 50 years ago, the movie business
dabbled with stereopsis, but it has been only in the last few years that it is
making a comeback with the advent of electronic projection systems. Stereopsis adds that wow factor to the viewing experience and is a must for any VR
system.
Artists, designers, scientists and engineers all use depth perception in their
craft. They also use computers: computer-aided design (CAD) software has
given them color displays, real-time 3D graphics and a myriad of human
interface devices, such as the mouse, joysticks etc. So the next logical thing
for CAD programs to offer their users is the ability to draw stereoscopically
as well as in 3D. In order to do this, one has to find some way of feeding
separate video or images to the viewers left and right eyes.
Drawing an analogy with stereophonics, where separate speakers or headphones play different sounds into your left and right ears, stereopsis must
show different pictures to your left and right eyes. However, stereopsis cannot
be achieved by putting two screens in front of the viewer and expecting her to
focus one eye on one and the other eye on the other. It just doesnt work. You
can of course mount a device on the viewers head (a head-mounted display
or HMD) that shows little pictures to each eye.
Technological miniaturization allows HMDs to be almost as light and
small as a rather chunky pair of sunglasses. Figure 10.1 illustrates two commercially available models that can be used for stereoscopic viewing or as part
of very personal video/DVD players. However, if you want to look at a com235
i
i
236
10. Stereopsis
Figure 10.1. Two examples of typical head-mounted display devices for video, VR
and stereoscopic work. Small video cameras can be build into the eyewear (as seen
on right) so that when used with motion and orientation sensors, the whole package
delivers a sense of being immersed in an interactive virtual environment.
i
i
10.1. Parallax
10.1
237
Parallax
There are many things we see around us that give clues as to how far away
something is or whether one thing is nearer or further away than something
else. Not all of these clues require us to have two eyes. Light and shade, interposition, texture gradients and perspective are all examples of monocular depth
cues, all of which we detailed in Section 2.1.2. Another monocular depth cue
is called motion parallax. Motion parallax is the effect weve all seen: if you
close one eye and move your head side to side, objects closer to you appear
to move faster and further than objects that are behind them. Interestingly,
all the monocular cues with the exception of motion parallax can be used
for depth perception in both stereoscopic and non-stereoscopic display environments, and they do a very good job. Just look at any photograph; your
judgment of how far something was from the camera is likely to be quite
reasonable.
Nevertheless, a persons depth perception is considerably enhanced by
having two eyes. The computer vision techniques which we discuss in Chapter 8 show just how valuable having two independent views can be in extracting accurate depth information. If it were possible to take snapshots
of what one sees with the left eye and the right eye and overlay them, they
would be different. Parallax quantifies this difference by specifying numerically the displacement between equivalent points in the images seen from the
two viewpoints, such as the spires on the church in Figure 10.2.
Parallax can be quoted as a relative distance, also referred to as the parallax
separation. This is measured in the plane of projection at the display screen or
monitor, and may be quoted in units of pixels, centimeters etc. The value of
parallax separation is dependent on two factors. The first factor is the distance
between the viewed object and the plane of zero parallax, which is usually the
plane of projection. This effect is shown in Figure 10.3. The second factor
concerns the distance between the viewpoint and the display. The effect of
this is shown in Figure 10.4, where the parallax separation at a monitor screen
at a distance s1 from the viewpoint will be t1 , whilst the parallax separation
at a large projection screen a distance s2 from the viewpoint will be t2 . We
can avoid having to deal with this second influence on the value of parallax
by using an angular measure, namely the parallax angle which is defined as
. This is also shown in Figure 10.4. For best effect, a parallax ( ) of about
1.5 is acceptable. Expressing parallax as an angle makes it possible to create
stereoscopic images with parallax separations that are optimized for display
on a computer monitor or in a theater.
i
i
238
10. Stereopsis
Figure 10.2. The images we see in our left and right eyes are slightly different, and
when overlapped they look like we have just had a bad night on the town.
i
i
10.1. Parallax
239
Figure 10.3. Parallax effects: (a) Parallax t is measured in pixel units or inches or
centimeters; d is the eye separation, on average 2.5 in. (64 mm) in humans. (b) Zero
parallax. (c) Positive parallax. (d) Negative parallax. (e) Divergent parallax (does not
occur in nature).
Positive. The object appears to hover behind the projection plane; this
effect is rather like looking through a window.
Negative. Objects appear to float in front of the screen. This is the
most exciting type of parallax.
Divergent. Normal humans are incapable of divergent vision and therefore this form of parallax warrants no further discussion.
Normally the value of parallax separation (t) should be less than the separation between the two viewpoints (d). This will result in the viewer expe-
i
i
240
10. Stereopsis
10.1.1
Since parallax effects play such a crucial role in determining the experience
one gets from stereopsis, it seems intuitive that it should be adjustable in some
way at the geometric stages of the rendering algorithm (in CAD, animation
or VR software). That is, it has an effect in determining how to position the
two cameras used to render the scene. Since the required parallax concerns
i
i
10.1. Parallax
241
viewpoints and projections, it will make its mark in the geometry stages of
rendering, as discussed in Section 6.6.6.
The most obvious geometrical model to construct would simulate what
happens in the real world by setting up two cameras to represent the left and
right eye viewpoints, both of which are focused on the same target. This target establishes the plane of zero parallax in the scene. Rendering is carried out
for the left and right views as they project onto planes in front of the left and
right cameras. If the target point is in front of the objects in the scene, they
will appear to be behind the screen, i.e., with positive parallax. If the target
is behind the objects, they will appear to stand out in front of the screen, i.e.,
negative parallax. Figure 10.5 illustrates the geometry and resulting parallax.
In this approach, we obtain the left view by adding an additional translation
by d/2 along the x-axis as the last step (before multiplying all the matrices together) in the algorithm of Section 6.6.6. For the right eyes view, the
translation is by d/2.
However, when using this form of set-up, the projection planes are not
parallel to the plane of zero parallax. When displaying the images formed at
these projection planes on the monitor (which is where we perceive the plane
of zero parallax to be), a parallax error will occur near the edges of the display.
Rather than trying to correct for this distortion by trying to transform the
Figure 10.5. The two viewpoints are separated by a distance d, and the views are
directed towards the same target. The origin of the coordinate system is midway
between the viewpoint with the x-axis towards the right, and the z-axis verticalout
of the plane of the diagram.
i
i
242
10. Stereopsis
projection planes onto the plane of zero parallax, we use an alterative method
of projection based on using parallel aligned cameras.
Using cameras with parallel axis alignment is known as off-axis rendering,
and it gives a better stereoscopic view than the one depicted in Figure 10.5.
The geometry depicted in Figure 10.6 sets the scene for this discussion. Two
cameras straddling the origin of the coordinate system are aligned so that they
both are parallel with the usual direction of view. Unfortunately, the projection transformation is different from the transformation for a single viewpoint, because the direction of view vector no longer intersects the projection
plane at its center (see Figure 10.7). Implementing a projective transformation under these conditions will require the use of a nonsymmetric camera
frustum rather than a simple field of view and aspect ratio specification. The
geometry of this not considered here, but real-time 3D-rendering software
libraries such as OpenGL offer functions to set up the appropriate projective
transformation.
So far so good, but we have not said what the magnitude of d should be.
This really depends on the scale we are using to model our virtual environment. Going back to the point we originally made, we should always have a
parallax separation of about 12 in. on the computer monitor and the center of
the model at the point of zero parallax. In addition, the farthest point should
have a negative parallax of about 1.5 and the nearest a parallax of about
+1.5 .
Figure 10.6. Two cameras with their directions of view aligned in parallel, along the yaxis. A point halfway between them is placed at the origin of the coordinate system.
We need to determine a suitable separation d to satisfy a given parallax condition
which is comfortable for the viewer and gives a good stereoscopic effect at all scales.
i
i
10.1. Parallax
243
Figure 10.7. Using off-axis (parallel) alignment results in a distorted projection which
is different for each of the viewpoints. To achieve this effect, the camera frustum must
be made to match the distorted viewing projection. The OpenGL and Direct3D
software libraries provide ways to do this.
i
i
244
10. Stereopsis
To do this, we will have to ensure that the scene falls within a volume of
space bounded by known values: [xmin , xmax ] and [ymin , ymax ]. The vertical
extent of the scene (between [zmin , zmax ]) does not enter the calculation. We
will also need to define the required field of view . This is typically = 40
to 50 . It is useful to define x = xmax xmin and y = ymax ymin .
From the angular definition of parallax (see Figure 10.4), we see that for
projection planes located at distances s1 and s2 from the viewpoint,
= 2 arctan
t1
t2
P
= 2 arctan
= 2 arctan .
2s1
2s2
2L
(10.1)
P = 2L tan .
2
In the conventional perspective projection of the scene, using a field of
view and the property that the scenes horizontal expanse x just fills the
viewport, the average distance L of objects in the scene from the camera is
L =
2 tan 2
For a twin-camera set-up, the parallax separation at this distance is obtained by substituting L for L in Equation (10.1) to give
P = 2L tan
tan 2
tan 2
x.
Requiring that the scene must have a maximum angular parallax max and a
minimum angular parallax min , it is equivalent to say that
Pmax =
Pmin =
tan max
2
tan 2
tan min
2
tan 2
x;
x.
i
i
10.1. Parallax
245
Figure 10.8. Parallax geometry. The plane of zero parallax is at the front of the scene.
/2
d /2
L
To determine d, write tan 2 = Pmax
y = L +y then d = Pmax (1 + y ).
(Note: Pmax and Pmin are measured in the units of the world coordinate system.)
In the case where we want the minimum angular parallax min = 0 , the
distance of zero parallax is Dzpx = ymin = L . The equality with L is true
because it is usually assumed that the plane of zero parallax lies at the average
distance between viewpoint and objects in the scene (see Figure 10.8).
To satisfy this constraint, the cameras must be separated by a distance d
(in world coordinate units) given by
L
.
(10.2)
d = Pmax 1 +
y
Alternatively, to choose a specific plane of zero parallax defined by its distance
from the viewpoint Dzpx , the camera separation d will be given by
L
1 ;
d1 = Pmin
Dzpx ymin
L
+1 ;
d2 = Pmax
ymax Dzpx
d = min(|d1 |, |d2 |).
i
i
246
10. Stereopsis
10.2
Head-Mounted Displays
i
i
247
Figure 10.9. Optical and electronic see-through head mounted displays. The device depicted on the right of Figure 10.1 is an example of an electronic stereoscopic
HMD.
in the line of sight. Aircraft head-up displays (HUDs) have used this
technique for many years and the principle is also used in television
autocues and teleprompters. A problem that arises with this type of
HMD is that the image may appear ghostly or not solid enough.
2. Small CCD video cameras can be placed on the other side of the eyewear, and video from these can be digitized and mixed back into the
video signals sent to each eye. This type of HMD might be subjected
to an unacceptable delay between the video being acquired from the
cameras and displayed on the video monitors in front of each eye.
The cost of HMDs is falling, and for certain applications they are a very
attractive enabling technology. Individual processors could be assigned to
handle the stereoscopic rendering of a 3D VR world for each HMD. Position
and orientation information from a wide variety of motion-sensing technology can also be fed to the individual processors. The 3D world description
and overall system control is easily handled via a master process and network
connections. The stereoscopic display is actually the easiest part of the system
to deliver. Stereo-ready graphics adapters (Section 16.1.1) are not required
for HMD work; any adapter supporting dual-head (twin-video) outputs will
do the job. Movies, images or 3D views are simply rendered for left and right
eye views and presented on the separate outputs. Under the Windows oper-
i
i
248
10. Stereopsis
i
i
10.3
249
The alternative to HMDs as a way of presenting different views to the left and
right eye is to present them on a monitor or projection screen in the usual way
but modify or tag them in some way so that a simple device (which the person
looking at the screen uses) can unmodify them just before they enter the eyes.
Essentially, this means wearing a pair of glasses, but unlike the HMDs, these
glasses can be as simple as a few bits of cardboard and transparent plastic1 .
10.3.1
This is probably the simplest and cheapest method of achieving the stereoscopic effect. It was the first method to be used and achieved some popularity
in movie theaters during the first half of the twentieth century. A projector or
display device superimposes two images, one taken from a left eye view and
one from the right eye view. Each image is prefiltered to remove a different
color component. The colors filtered depend on the eyeglasses worn by the
viewer. These glasses tend to have a piece of red colored glass or plastic in
front of the left eye and a similar green or blue colored filter for the right side.
Thus, the image taken from the left eye view would have the green or blue
component filtered out of it so that it cannot be seen by the right eye, whilst
the image taken from the right eye viewpoint would have the red component
filtered from it so that it cannot be viewed by the left eye. The advantages of
this system are its very low cost and lack of any special display hardware. It
will work equally well on CRT and LCD monitors and on all video projectors, CRT, LCD or DLP. The disadvantage is that there is no perception of
true color in the pictures or movie. Figure 10.11 illustrates a typical system.
10.3.2
Active Stereo
This system can be used with a single monitor or projector. The viewer wears
a special pair of glasses that consist of two remotely controlled LCD shutters working synchronously with the projector or screen. The glasses may
be connected to the graphics adapter in the computer or they may pick up
1
This very simple idea can be dressed up in some impressive-sounding technical language:
the input signal (the pictures on the screen) is modulated onto a carrier using either timedivision multiplexing (active stereo), frequency modulation (anaglyph stereo) or phase shift
modulation (passive stereo).
i
i
250
10. Stereopsis
Figure 10.11. Anaglyph stereo, simple paper and plastic glasses with colored filters
can give surprisingly good results. This cheap technology is ideal for use with large
audiences.
10.3.3
Passive Stereo
A single CRT or DLP projector operates at double refresh rate and alternates
between showing images for left and right eye. A polarizing screen is placed in
front of the projector. The screen is synchronized with the graphics adapter to
alternate between two states to match the polarizing filters in a pair of glasses
worn by the viewer. For example, the glasses may have a horizontal polarizing
i
i
251
Figure 10.12. Active stereo single CRTDLP projectors. Top-quality and low-cost
eyewear systems.
glass in front of the right eye and a vertical polarizing filter in the glass in
front of the left eye. Then, when the polarizing screen is vertically polarized,
the light it lets through is also vertically polarized, and will pass through the
eyewear to the left eye but be completely blocked from being seen by the right
eye. At a high enough refresh rate, greater than 100 Hz, the viewer is unaware
of any of this switching.
Unlike active stereo, this system will work equally well with two projectors. One projector has a polarizing screen matching the left eye polarizing
filter and the other ones screen matches the right eye. Since there is no need
to switch rapidly between left and right images in this case, LCD projectors
can be used. CRT projectors can also be used; despite being slightly less
bright, they allow for much higher resolutions (greater than 1024 768).
One last comment on passive stereopsis: horizontal and vertical polarization cause a problem as the viewer tilts his head. This effect, known as
crosstalk, will increase until the tilt reaches 90 at which point the views are
completely flipped. To avoid crosstalk, circular polarization is used, clockwise for one eye, anticlockwise for the other. For more information on the
theory of polarization relating to light and other electromagnetic radiation,
browse through Wangsness [4] or any book discussing the basics of electromagnetic theory. Circular polarizing filters must use a tighter match between
glasses and polarizing screen and therefore may need to be obtained from the
same supplier. Figure 10.13 illustrates typical deployments for passive stereo
systems.
i
i
252
10. Stereopsis
Figure 10.13. Passive stereo can be delivered with one or two projectors and a cheap
10.3.4
Autostereopsis
Autostereopsis is the ultimate challenge for stereoscopic technology. Put bluntly, you throw away the eyewear and head-mounted displays, and the display
alone gives you the illusion of depth. One approach to autostereopsis requires
that the observers head be tracked and the display be adjusted to show a
different view which reflects the observers head movement. Two currently
successful concepts are illustrated in Figures 10.14 and 10.15, both of which
have been successfully demonstrated in practical field trials.
In the first approach, the goal is to make the left and right images appear
visible to only the left and right eyes respectively by obscuring the other image
when seen from a sweet spot in front of the screen. There are two ways in
which this can be achieved, but both ideas seek to divide the display into two
sets of interleaved vertical strips. In set 1 an image for the left eye is displayed,
Figure 10.14. Autostereoscopic displays may be built by adding either a special layer
of miniature lenses or a grating to the front of a traditional LCD monitor that causes
the left and right images to be visible at different locations in front of the monitor.
At a normal viewing distance, this corresponds to the separation of the eyes.
i
i
253
Figure 10.15. LightSpace Technologies DepthCube [3] creates a volume into which
whilst in set 2 an image for the right eye is displayed. The first approach uses
a lenticular glass layer that focuses the vertical strips at two points 30 cm or
so in front of the display and approximately at the positions of the left and
right eye respectively. The second method uses the grating to obscure those
strips that are not intended to be visible from the left eye and do the same for
the right eye.
A more user-friendly system developed by LightSpace Technologies [3]
builds a layered image by back-projecting a sequence of images into a stack of
LCD shutters that can be switched from a completely transparent state into
a state that mimics a back projection screen on which the image will appear
during a 1 ms time interval. The system uses 20 LCD layers and so can
display the whole sequence of depth images at a rate of 50 Hz. A patented
antialias technique is built into the hardware so as to blend the layer image,
so that they dont look like layered images. A DLP projector is used for the
projection source.
10.3.5
In the context of stereopsis, crosstalk is when some of the image that should
only be visible by the left eye is seen by the right eye, and vice versa. In active
stereo systems, it is primarily a function of the quality of the eyewear. There
are three possible sources of crosstalk.
1. Imperfect occlusion, where the polarizing filters fail to block out all
the light. Eyewear at the top end of the market is capable of deliver-
i
i
254
10. Stereopsis
10.3.6
i
i
10.4. Summary
255
screens, there are no problems, but the challenge is to accommodate stereopsis under such conditions. Each individual will have a slightly different
viewpoint, point of focus etc., and these effects amplify the differences. Revolving and providing solutions to VR-specific issues such as this are still at
the research stage, and no definitive or obvious answer has emerged. Some
experiments have been done by using shutter eye systems that display two
images for every participant in the room. Unfortunately, this will involve
operating the display and polarizing shutters at much higher frequencies.
At present, systems that can accommodate three people have been demonstrated, and in one case even a revolving mechanical disk acting as a shutter
that sits in front of a projector has been tried. It would seem, however, that
the most promising approach involves trying to improve the LCD shutter
technology and projection systems. Fortunately, to a first approximation, the
effect on individuals of rendering a generic stereoscopic viewpoint from the
center of the cave does not introduce a distortion that is too noticeable. The
overwhelming sensation of depth perception outweighs any errors due to the
actual point of focus not being the one used to generate the stereoscopic view.
We will not take this discussion any further here, but we suggest that you
keep an eye on the research literature that appears at conferences such as the
SPIE [2] annual meeting on electronic imaging.
10.3.7
All the illustrations used so far have depicted front projection. There is no
particular reason for this. Back projection works equally well for stereopsis.
Sometimes the term used to describe back-projection stereopsis is a stereo wall.
If there are any drawbacks, they are purely in terms of space, but even this can
be overcome by using a mirror with the projectors mounted in a number of
different possible configurations. Back projection has the major advantage
that anyone working in the stereo environment can go right up to the screen
without blocking out the light from the projector.
10.4
Summary
This chapter has introduced the concept, notation and hardware of stereopsis. We have looked at both low-cost (e.g., stereo glasses) and high-cost (e.g.,
head-mounted displays) hardware and how they work to achieve the stereo
effect. So regardless of your VR budget, you should be able to develop a
i
i
256
10. Stereopsis
Bibliography
[1] C. Bungert. HMD/Headset/VR-Helmet Comparison Chart. http://www.
stereo3d.com/hmd.htm.
[2] IS&T and SPIE Electronic
electronicimaging.org/.
Imaging
Conference
Series.
http://
i
i
11
Navigation and
Movement
in VR
So far, we have looked at how to specify 3D scenes and the shape and appearance of virtual objects. Weve investigated the principles of how to render
realistic images as if we were looking at the scene from any direction and using
any type of camera. Rendering static scenes, even if they result in the most
realistic images one could imagine, is not enough for VR work. Nor is rendering active scenes with a lot of movement if they result from predefined actions
or the highly scripted behavior of camera, objects or environment. This is the
sort of behavior one might see in computer-generated movies. Of course, in
VR, we still expect to be able to move the viewpoint and objects around in
three dimensions, and so the same theory and implementation detail of such
things as interpolation, hierarchical movement, physical animation and path
following still apply.
What makes VR different from computer animation is that we have to be
able to put ourselves (and others) into the virtual scene. We have to appear
to interact with the virtual elements, move them around and get them to do
things at our command. To achieve this, our software must be flexible enough
to do everything one would expect of a traditional computer animation package, but spontaneously, not just by pre-arrangement. However, on its own,
software is not enough. VR requires a massive contribution from the human
interface hardware. To put ourselves in the scene, we have to tell the software
where we are and what we are doing. This is not easy. We have to synchronize the real with the virtual, and in the context of navigation and movement,
257
i
i
258
this requires motion tracking. Motion-tracking hardware will allow us to determine where we (and other real objects) are in relation to a real coordinate
frame of reference. The special hardware needed to do this has its own unique
set of practical complexities (as we discussed in Section 4.3). But once we can
acquire movement data in real time and feed it into the visualization software,
it is not a difficult task to match the real coordinate system with the virtual
one so that we can appear to touch, pick up and throw a virtual object. Or
more simply, by knowing where we are and in what direction we are looking,
a synthetic view of a virtual world can be fed to the display we are using, e.g.,
a head-mounted display.
In this chapter, we start by looking at those aspects of computer animation which are important for describing and directing movement in VR. Principally, we will look at inverse kinematics (IK), which is important when simulating the movement of articulated linkages, such as animal/human movement and robotic machinery (so different to look at, but morphologically so
similar). We will begin by quickly reviewing how to achieve smooth motion
and rotation.
11.1
Computer Animation
i
i
259
i
i
260
video frame. For one view it might take 10ms, for another 50 ms. As we
indicated in Chapter 7, the time it takes to render a scene depends on the
scene complexity, roughly proportional to the number of vertices within the
scene.
As we shall see in Part II of the book, application programs designed
for real-time interactive work are usually written using multiple threads2 of
execution so that rendering, scene animation, motion tracking and haptic
feedback can all operate at optimal rates and with appropriate priorities. If
necessary, the rendering thread can skip frames. These issues are discussed in
Chapter 15.
11.2
Rigid motion is the simplest type of motion to simulate, because each object
is considered an immutable entity, and to have it move smoothly from one
place to another, one only has to specify its position and orientation. Rigid
motion applies to objects such as vehicles, airplanes, or even people who do
not move their arms or legs about. All fly-by and walkthrough-type camera
movements fall into this category.
A point in space is specified by a position vector, p = (x, y, z). To specify
orientation, three additional values are also required. There are a number of
possibilities available to us, but the scheme where values of heading, pitch and
roll (, , ) are given is a fairly intuitive description of a models orientation
(these values are also referred to as the Euler angles). Once the six numbers (x, y, z, , , ) have been determined for each object or a viewpoint, a
transformation matrix can be calculated from them. The matrix is used to
place the object at the correct location relative to the global frame of reference
within which all the scenes objects are located.
11.2.1
Moving Smoothly
Finding the position of any object (camera, etc.) at any time instant t, as it
moves from one place to another, involves either mathematical interpolation
or extrapolation. Interpolation is more commonly utilized in computer animation, where we know the starting position of an object (x1 , y1 , z1 ) and its
2
Any program in which activities are not dependent can be designed so that each activity
is executed by its own separate and independent thread of execution. This means that if we
have multiple processors, each thread can be executed on a different processor.
i
i
261
orientation (1 , 1 , 1 ) at time t1 and the final position (x2 , y2 , z2 ) and orientation (2 , 2 , 2 ) at t2 . We can then determine the position and orientation
of the object through interpolation at any time where t1 t t2 . Alternatively, if we need to simulate real-time motion then it is likely that we will
need to use two or more positions in order to predict or extrapolate the next
position of the object at a time in the future where t1 < t2 t.
Initially, lets look at position interpolation (angular interpolation is discussed separately in Section 11.2.2) in cases where the movement is along
a smooth curve. Obviously, the most appropriate way to determine any required intermediate locations, in between our known locations, is by using a
mathematical representation of a curved path in 3D spacealso known as a
spline.
It is important to appreciate that for spline interpolation, the path must
be specified by at least four points. When a path is laid out by more than four
points, they are taken in groups of four. Splines have the big advantage that
they are well-behaved and they can have their flexibility adjusted by specified
parameters so that it is possible to have multiple splines or paths which go
through the same control points. Figure 11.1 illustrates the effect of increasing
and decreasing the tension in a spline.
The equation for any point p lying on a cubic spline segment, such as
that illustrated in Figure 11.2, is written in the form
p() = K3 3 + K2 2 + K1 + K0 ,
(11.1)
Figure 11.1. Changes in the flexibility (or tension) of a spline allow it to represent
i
i
262
(11.2)
xi
0 0 0 1
K3x
1 1 1 1 K2x xi+1
0 0 1 0 K1x = xi .
K0x
xi+1
3 2 1 0
i
i
263
=
=
=
=
2xi 2xi+1 + xi + xi+1
,
3xi + 3xi+1 2xi xi+1
,
xi ,
xi+1
.
However, as you can see, determination of the vector constants is dependent on finding the gradient of the spline (or its derivative) at the two control
points, Pi and Pi+1 . Remember that we can only define a spline using a minimum of four values. Thus we can use the knowledge we have of the other
two points (Pi1 and Pi+2 ) on the spline in order to estimate the derivative
i1
and
at Pi and Pi+1 . We do this using finite differences, where xi = xi+1 x
2
xi+2 xi
. That is, we are really finding the gradient of the spline at
xi+1 =
2
these two points.
This results in the following sequence of equations:
K3x
K2x
K1x
K0x
1
3
3
1
= xi1 + xi xi+1 + xi+2 ,
2
2
2
2
5
1
= xi1 xi + 2xi+1 xi+2 ,
2
2
1
3
= xi1 + xi+1 ,
2
2
= xi .
(11.3)
(11.4)
(11.5)
(11.6)
Kcx
Kc = Kcy ,
Kcz
for c = 0, 1, 2, 3.
At this stage, we need to draw a distinction about how we intend to use
our spline. When we are animating camera movement, for example, Pi1 to
Pi+2 are all predetermined positions of the camera. That is, they are key positions. Then we can simply interpolate between these predetermined positions
in order to estimate how the camera moves so that the transition will appear
smooth, with no sudden changes of direction. We can do this by using Equation (11.1) to determine any position p at time t along the spline. Of course,
i
i
264
in order to use this equation, we need to insert a value for . Assuming that
an object following the spline path is getting its known locations (the Pi ) at
equal time intervals t, we can parameterize the curve by time and obtain
from
t t2
.
=
t
This arises because the spline has been determined using four control
positions. In Figure 11.2, these control points are Pi1 , Pi , Pi+1 and Pi+2 .
To determine the constant vectors, we set = 0 at Pi . If we wished to
interpolate between control points Pi+1 and Pi+2 , we would use values of
in the range 1 < < 2.
Of course, as we mentioned earlier, spline interpolation is only useful
when we have predetermined key positions. Extrapolation is required when
there is no knowledge of future movement. Take for example our flight simulator, where we obtain information about position from external hardware
every second. However, suppose we must render our frames every hundredth
of a second. Assuming that Pi+2 is the last known position obtained from the
i
i
265
hardware, our software needs to extrapolate the position of our aircraft every
one hundredth of a second until the next position, Pi+3 , is available from the
hardware.
Again, we may use Equation (11.1) to determine the position p at any
time greater than t. Thus we try to predict how the spline will behave up
until our system is updated with the actual position. This may or may not
be the same as our extrapolated position. Typically, parameterization by time
and extrapolation can lead to a small error in predicted position. For example,
the point labeled Pactual in Figure 11.2 is the actual position of the next point
along the true path, but it lies slightly off the predicted curve.
So we need to recalculate the equation of the spline based on this current
actual position and its previous three actual positions. Thus, Pi becomes Pi1
and so on. Using the four new control positions, we recompute the spline,
and this new spline will then be used to predict the next position at a given
time slice. And so the procedure continues.
It is also worth noting at this stage that whilst spline extrapolation is not
without error, linear extrapolation is usually much more error-prone. For the
example in Figure 11.2, if linear extrapolation of the positions Pi+1 and Pi+2
is used to predict the new position at time t, the associated positional error is
much greater than with spline interpolation.
11.2.2
Rotating Smoothly
i
i
266
ciated with a rotation matrix using the slerp() function in the following three
steps:
1. Given an orientation that has been expressed in Euler angles at two
time points, l and m, calculate equivalent quaternions ql and qm , using
the algorithm given in section A.2.1.
2. Interpolate a quaternion qk that expresses the orientation at time tk
using:
=
tk tl
;
tm tl
= cos1 ql ql ;
qk =
sin(1 )
sin
ql +
qm .
sin
sin
11.3
Robotic Motion
i
i
267
i
i
268
Figure 11.3. A model with its skeleton: In (a), the model is shown; (b) shows the
skeleton (the thick lines). A pivot point is shown at the end of each bone. The thin
lines represent boxes that contain all the parts of the model attached to each bone.
In (c), a hierarchical representation of all the bones and how they are connected is
illustrated.
Figure 11.4. Pivoting the leg of the model shown in Figure 11.3 into two poses.
an example, a rotation of the front right upper leg around the hip joint moves
the whole front right leg (see Figure 11.4(a)). If this is followed by rotations
of the lower leg and foot, the pose illustrated in Figure 11.4(b) results.
11.3.1
Any skeleton has what we can term a rest pose, which is simply the form
in which it was created, before any manipulation is applied. The skeleton
i
i
269
illustrated in Figure 11.3 is in a rest pose. With the knowledge of hierarchical connectivity in a skeleton, a position vector giving the location of the end
(the node or the joint) of each bone uniquely specifies the skeletons rest position. To be able to set the skeleton into any other pose, it must satisfy the
criteria:
1. Be obtained with a rotational transformation about an axis located at
one of the nodes of the skeleton.
2. Of a transformation is deemed to apply to a specific bone, say i, then
it must also be applied to the descendent (child) bones as well.
For example, consider the simple skeleton illustrated in Figure 11.5. It
shows four bones; bones 2 and 3 are children to bone 1, and bone 4 is a child
to bone 3. Each bone is given a coordinate frame of reference (e.g., (x3 , y3 , z3 )
for bone 3). (For the purpose of this example, the skeleton will be assumed
to lie in the plane of the page.) P0 is a node with no associated bone; it acts
as the base of the skeleton and is referred to as the root.
Suppose that we wish to move bone 3. The only option is to pivot it
around a direction vector passing through node 1 (to which bone 3 is attached). A rotational transformation is defined as a 4 4 matrix, and we
can combine rotations around different axes to give a single matrix M that
encodes information for any sequence of rotations performed at a point. This
matrix takes the form shown in Equation (11.7), whilst the general theory
Figure 11.5. Specifying a rest pose of a skeleton with four bones 1 to 4. In addition
to the positions Pi of the nodes at the end of the bone, a local frame of reference
(xi , yi , zi ) is attached to each node. On the right, a hierarchical diagram of the skeleton is shown. P0 is a root node which has no bone attached and acts as the base of
the skeleton.
i
i
270
Figure 11.6. (a) Posing for the skeleton of Figure 11.5 by a rotation of bone 3 around
the node at the end of bone 1. (b) Co-positional nodes do not imply that two poses
are identical. The nodes here are at the same location as in (a), but the local frames
of reference do not take the same orientation.
Once M has been calculated, its application to P3 and P4 will move them
to appropriate locations for the new pose. The example pose in Figure 11.6(a)
was obtained by a rotation of round axis y2 .
There are a few important observations that emerge from this simple
example:
Node 4 is affected by the transformation because it is descended from
node 3.
Nodes 0, 1 and 2 are unaffected and remain at locations p0 , p1 and
p2 . Importantly, the node about which the rotation is made is not
disturbed in any way.
The coordinate frames of reference attached to nodes 3 and 4 are also
transformed; they become (x3 , y3 , z3 ) and (x4 , y4 , z4 ).
Although node 4 (at p4 ) is moved, its position is unchanged in the local
frame of reference attached to node 3.
When M is applied to nodes 3 and 4, it is assumed that their coordinates are expressed in a frame of reference with origin at P1 .
i
i
271
It is possible to apply additional rotational transformations until any desired pose is achieved. To attain the pose depicted in Figure 11.4(b), several successive transformations from the rest position of Figure 11.3 were
necessary.
Note that even if the position of the nodes in two poses are identical,
it does not mean that the poses are identical. Look at Figure 11.6(b); this
shows four bones (four nodes) co-positional with the nodes in the pose of
Figure 11.6(a). Close inspection reveals that the local frames of reference for
the joint at P4 are very different. Therefore, the parts of the object attached
to P4 will appear different. For example, if P4 represents the joint angle at
the wrist of a virtual character then the characters hand will have a different
orientation between Figure 11.6(a) and Figure 11.6(b).
As we have already mentioned, if it is necessary to interpolate between
robot poses then quaternions offer the best option. Full details are given
in [1]. This is especially true in animation, where knowledge of the starting
and final position is available. However, for real-time interactive movement,
it is more appropriate to use inverse kinematics. This is another specialized
branch of mathematics that deals specifically with determining the movement
of interconnected segments of a body or material such that the motion is
transmitted in a predictable way through the parts.
11.4
Inverse Kinematics
i
i
272
11.4.1
The IK Problem
x1 x2 ... xn
T
We can then further say that the location of the end of the articulation X is
given by some function,
(11.8)
X = f (
).
Thus, if we know the orientation of each of the links connected to the effector, we can compute its position. This process is called forward kinematics.
However, in a typical environment, the operator of the articulated linkage is
really only concerned with setting the position of the effector and not with
deriving it from linkage orientation values. Thus we enter the realm of inverse kinematics. That is, knowing the position we require the effector X to
take, can we determine the required position and orientation of the adjoining
linkages to take the effector to this position and orientation?
i
i
273
To solve this problem, we must find the inverse of the function f () so that
we can write Equation (11.8) as
= f 1 (X).
(11.9)
It should be stressed that it may not be possible for the effector to reach the
desired position X, and hence we call it a goal.
Since X and
are both vectors, it may be possible to rewrite Equation (11.8) in matrix form assuming the function f () is linear:
X = A
,
(11.10)
(11.11)
and then incrementally stepping towards the solution. Since f () is a multivariable function, f (), the Jacobian matrix of partial derivatives, is given by
J ( ) =
X1
1
X2
1
...
Xm
1
X1
2
X2
2
...
Xm
2
...
...
...
...
X1
n
X2
n
...
Xm
n
i
i
274
where n is a count of the number of iterations, and at each step J 1 is recalculated from the current n .
11.4.2
i
i
275
Figure 11.7. (a) Several possible orientations for a three-link two-dimensional articulated figure. In (1) and (2), the articulation reaches to the same endpoint. (b) A
three-link articulation.
link i is pointing. The first link angle 1 is referenced to the x-axis. Using
this information, we can write an expression for the (x, y)-coordinate of P4 :
x
y
.
(11.12)
i
i
276
x
x
1 2
J = y
y
1 2
The terms
i
x
3 .
y
3
where
J11
J12
J13
J21
J22
J23
=
=
=
=
=
=
Once J has been calculated, we are nearly ready to go through the iteration
process that moves P4 towards its goal. An algorithm for this will be based on
the equations
n+1
Xn+1
=
n + J (
n )1 X;
= f (
n+1 ).
i
i
277
i
i
278
n
i
li cos(
j ),
i=1
y =
n
j=i
i
li sin(
j ),
i=1
x
k
y
k
j=i
n
i
li sin(
j ),
j=1
i=k
n
i
li cos(
i=k
j );
j=1
k = 1, , n.
11.4.3
All the steps in the solution procedure presented in Figure 11.8 are equally
applicable to the 3D IK problem. Calculation of generalized inverse, iteration
i
i
279
towards a goal for the end of the articulation and the termination criteria are
all the same. Where the 2D and 3D IK problems differ is in the specification
of the orientation of the links, the determination of the Jacobian and the
equations that tell us how to calculate the location of the joints between links.
Although the change from 2D to 3D involves an increase of only one
dimension, the complexity of the calculations increases to such an extent
that we cannot normally determine the coefficients of the Jacobian by differentiation of analytic expressions. Instead, one extends the idea of iteration
to write expressions which relate infinitesimal changes in linkage orientation to infinitesimal changes in the position of the end effector. By making
small changes and iterating towards a solution, it becomes possible to calculate the Jacobian coefficients under the assumptions that the order of rotational transformation does not matter, and that in the coefficients, sin
1
and cos
1. We will not consider the details here; they can be found
in [1]. Instead, we will move on to look at a very simple algorithm for IK in
3D.
i
i
280
Step 1:
Start with the last link in the chain (at the end effector).
Step 2:
Set a loop counter to zero. Set the max loop count to some upper limit.
Step 3:
Rotate this link so that it points towards the target. Choose
an axis of rotation that is perpendicular to the plane defined
by two of the linkages meeting at the point of rotation.
Do not violate any angle restrictions.
Step 4:
Move down the chain (away from the end effector) and go back to Step 2.
Step 5:
When the base is reached, determine whether the target has
been reached or the loop limit is exceeded. If so, exit;
if not, increment the loop counter and repeat from Step 2.
Figure 11.9. The cyclic coordinate descent iterative algorithm for manipulating an IK
chain into position. It works equally well in two or three dimensions.
Figure 11.10. The cyclic coordinate descent algorithm. Iteration round 1 (loop
count=1). The original position of the linkage is shown in light grey.
i
i
11.5. Summary
281
Figure 11.11. The cyclic coordinate descent algorithm. Iteration round 2 (loop
count=2) completes the cycle because the end effector had reached the target.
the goal, but this can be detected by the loop count limit being exceeded.
Typically, for real-time work, the loop count should never go beyond five.
The Jacobian method has the advantage of responding to pushes and pulls
on the end effector in a more natural way. The CCD method tends to move
the link closest to the end effector. This is not surprising when you look at
the algorithm, because if a rotation of the last linkage reaches the target, the
algorithm can stop. When using these methods interactively, there are noticeable differences in response; the CCD behaves more like a loosely connected
chain linkage whilst the Jacobian approach gives a configuration more like
one would imagine from a flexible elastic rod. In practical work, the CCD
approach is easy to implement and quite stable, and so is our preferred way
of using IK in real-time interactive applications for VR.
11.5
Summary
This chapter looked at the theory underlying two main aspects of movement
in VR, rigid motion and inverse kinematics (IK). Rigid motion is performed
by using some form of interpolation or extrapolation, either linear, quadratic,
or cubic in the form of a 3D parametric spline. Interpolation is also important
in obtaining smooth rotation and changes of direction. In this case, the most
popular approach is to use quaternion notation. That was explained together
with details of how to convert back and forward from the more intuitive Euler
angles and apply it in simulating hierarchical motion.
IK is fundamentally important in determining how robots, human avatars
and articulated linkages move when external forces are applied at a few of
their joints. We introduced two algorithms that are often used to express
IK behavior in VR systems, especially when they must operate in real time.
i
i
282
There is a close connection between the use of IK and the motion capture
and motion-tracking hardware discussed in Chapter 4. This is because IK
can predict the motion of parts of a linkage that have not been explicitly
tracked. It is especially useful in those cases where the tracked points relate to
human actions. For example, some dance steps might be captured by tracking
five markers on a human figure: two on the feet, two on the hands and the
other on the torso. A well-thought-out IK model will allow the motion of
ankles, knees, hips, wrists, elbows and shoulders to be simulated with a very
satisfying degree of realism.
And so, with all the theoretical detail explained with regards to creating
your own VR environment, it is now time to summarize all we have reviewed
before going on to the practical aspects of VR implementation.
Bibliography
[1] R. S. Ferguson. Practical Algorithms for 3D Computer Graphics. Natick, MA:
A K Peters, 2001.
[2] T. R. McCalla. Introduction to Numerical Methods and FORTRAN Programming. New York: John Wiley and Sons, 1967.
[3] C. Welman. Inverse Kinematics and Geometric Constraints for Articulated Figure
Manipulation. MSc Thesis, Simon Fraser University, 1993.
i
i
Summing Up
We have now reached the halfway point of the book. Back in Chapter 3, we
offered the opinion that VR has many useful applications with benefits for
mankind, but it also has its drawbacks, can be misused and can be applied in
situations where the original non-VR alternatives are still actually better. We
suggested that you (the reader) should make up your own mind whether VR is
appropriate or will work satisfactorily in an area that you may be considering
it for.
In the intervening chapters, we have explored the theory and technology
that delivers the most up-to-date virtual experiences. We have seen that a
VR system must interact with as many senses as possible to deliver a virtual
experience. To date, most research has been dedicated to delivering a realistic
visual experience. This is because sight is regarded as the most meaningful
human sense, and therefore is the sense that VR primarily tried to stimulate.
In addition, the process of sight is well understood, and most modern technology makes use of this knowledge, e.g., cameras and visual display devices.
Visualization technology is also very advanced because of its importance to
the entertainment industry (e.g., computer gaming). As a result, we can deliver highly realistic graphics in real time to the users of the VR environment.
Our ability to do this is also thanks to the huge revolution that graphics processors have undergone. Now we are able to program these directly, and as
such we can achieve much more realism in our graphic displays.
However, when we try to stimulate even more senses in VR, e.g., sound
and touch, the experience exceeds the sum of its parts. This is a very
important point and is one reason why our guide covers such a broad spectrum of ideas, from computer graphics to haptics. Indeed, we try to hint
that the term rendering, so familiar in graphics, is now being used to present
283
i
i
284
11. Summing Up
a 3D audio scene as well as painting haptic surfaces that we feel rather than
see. Sadly, this is difficult, and it may be a long time before it is possible
to render acoustically or haptically as realistic a scene as one can do
graphically.
Having described all the relevant technology involved in VR, we then
tried to give you an understanding of the internals of delivering the VR experience; that is, the software processing that must be undertaken to deliver
this experience. This is not a trivial topic, but if you want to build your own
virtual environment then you should be able to grasp the basics of this, e.g.,
graphics rendering, the Z-buffer, lighting etc.
To complete our first section, we also introduced the concept of computer
vision. For some time, the sciences of computer vision (with its applications
in artificial intelligence and robotics) and computer graphics have been considered separate disciplines, even to the extent of using different notations for
expressing exactly the same thing. But, for VR and the increasingly important augmented reality (AR), we need to bring these together and use both as if
they are one. For example, we can utilize computer vision to create VR from
pictures rather than build it graphically from mesh models. This is an ideal
way of creating the surrounding environment for certain VR experiences, e.g.,
walkthroughs.
Now, at the end of Part I, we hope that you have read enough on the
concepts of VR to help you answer whether VR is appropriate and suitable to
help enhance your own application.
i
i
285
does offer a benefit to mankind, one that should equal the communications
revolution of the Internet. In fact, VR is really just a part of it.
i
i
286
11. Summing Up
you some exciting and useful software. In many cases, the same software can
be used to drive low-cost bespoke/experimental VR technology (as illustrated
in Chapter 4), right up to the most lavishly endowed commercial systems.
What we can offer you are some pieces of the software jigsaw for VR, which
you can place together with some basic interface components to achieve your
own VR experience. This is where Part II comes in! So please read on.
i
i
Part II
Practical Programs
for VR
i
i
12
Tools, Libraries
and Templates
for VR
This part of the book provides practical solutions and small projects for a
wide variety of VR requirements such as image and video processing, playing movies, interacting with external devices, stereoscopy and real-time 3D
graphics.
The ideas, algorithms and theoretical concepts that we have been looking
at in Part I are interesting, but to put them into practice, one must turn them
into computer programs. It is only in the memory of the computer that the
virtual world exists. It is through VR programs and their control over the
computers hardware that the virtual (internal) world is built, destroyed and
made visible/touchable to us in the real world. This part of the book will
attempt to give you useful and adaptable code covering most of the essential
interfaces that allow us to experience the virtual world, which only exists as
numbers in computer files and memory buffers.
To prepare the computer programs, it is necessary to use a wide variety of
software tools. No one would think of trying to use a computer these days if
it didnt have an operating system running on it to provide basic functions.
In this chapter, we will start by looking at the development tools needed to
write the VR application programs. We will also discuss the strategy used in
the examples: its goal is to make the code as simple and adaptable as possible.
And we will look at the special problems and constraints posed by VR applications, especially how hardware and software must collaborate to deliver
the interfaces and speed of execution we require. VR applications, more of289
i
i
290
12.1
In order to build and execute all the sample applications described in the next
few chapters, you will need to have a basic processor. Because Windows is
the dominant PC desktop environment, all our applications will be targeted
for that platform. A few of the examples can be compiled on other platforms;
when this is possible, we will indicate how. The most important hardware
component for creating a VR system is the graphics adapter. These are discussed in Section 12.1.1.
For the video applications, you will also need a couple of sources. This
can be as low-tech as hobbyists USB (universal serial bus) webcams, digital camcorder with a FireWire (or IEEE 1394) interface or a TV tuner card.
Unfortunately, if you try to use two digital camcorders connected via the
IEEE 1394 bus then your computer is unlikely to recognize them both. Applications making use of DirectShow1 (as we will be doing in Chapter 15)
will only pick up one device. This is due to problems with the connection
1
DirectShow, DirectInput, Direct3D and DirectDraw are all part of Microsofts comprehensive API library for developing real-time interactive graphical and video applications called
DirectX. Details of these components will be introduced and discussed in the next few chapters, but if you are unfamiliar with them, an overview of their scope can be found at [5].
i
i
291
12.1.1
Graphics Adapters
Different display adapter vendors often provide functions that are highly optimized for some specific tasks, for example hardware occlusion and rendering
with a high dynamic range. While this has advantages, it does not make applications very portable. Nevertheless, where a graphics card offers acceleration,
say to assist with Phong shading calculations, it should be used. Another interesting feature of graphics cards that could be of great utility in a practical
VR environment is support for multiple desktops or desktops that can span
two or more monitors. This has already been discussed in Chapter 4, but it
may have a bearing on the way in which the display software is written, since
two outputs might be configured to represent one viewport or two independently addressable viewports. One thing the graphics card really should be
able to support is programmability. This is discussed next.
12.1.2
GPUs
The graphics display adapter is no longer a passive interface between the processor and the viewerit is a processor in its own right! Being a processor,
it can be programmed, and this opens up a fantastic world of opportunities, allowing us to make the virtual world more realistic, more detailed and
much more responsive. We will discuss this in Chapter 14. Its difficult to
quantify the outstanding performance the GPU gives you, but, for example,
NVIDIAs GeForce 6800 Ultra chipset can deliver a sustained 150 Gflops,
which is about 2,000 times more than the fastest Pentium processor. It is no
2
OpenGL is a standard API for real-time 3D rendering. It provides a consistent interface
to programs across all operating systems and takes advantage of any hardware acceleration
available.
i
i
292
wonder that there is much interest in using these processors for non-graphical
applications, particularly in the field of mathematics and linear algebra. There
are some interesting chapters on this subject in the GPU Gems book series,
volumes 1 and 2 [3, 9].
The architectural design of all GPUs conforms to a model in which there
are effectively two types of subprocessor, the vertex and fragment processors. The phenomenal processing speed is usually accomplished by having
multiple copies of each of these types of subprocessor: 8 or 16 are possible. Programming the vertex processor and fragment processor is discussed in
Chapter 14.
12.1.3
i
i
12.2
293
Software Tools
At the very least, we need a high-level language compiler. Many now come
with a friendly front end that makes it easier to manage large projects with
the minimum of fuss. (The days of the traditional makefile are receding fast.)
Unfortunately, this is only half the story. For most applications in VR, the
majority of the code will involve using one of the system API libraries. In
the case of PCs running Windows, this is the platform software developers
kit (SDK). For real-time 3D graphics work, a specialist library is required.
Graphics APIs are introduced and discussed in Chapter 13, where we will
see that one of the most useful libraries, OpenGL, is wonderfully platformindependent, and it can often be quite a trivial job to port an application between Windows, Linux, Mac and SGI boxes. We shall also see in Chapter 17
that libraries of what is termed middleware can be invaluable in building VR
applications.
12.2.1
Almost without exception, VR and 3D graphics application programs are developed in either the C and/or C++ languages. There are many reasons for
this: speed of program execution, vast library of legacy code, quality of developer toolsall of the operating systems on the target platforms were created
using C and C++. So we decided to follow the majority and write our code
in C and C++. We do this primarily because all the appropriate real-time 3D
graphics and I/O libraries are used most comfortably from C or C++. Since
OpenGL has the longest history as a 3D graphics library, it is based on the
functional approach to programming, and therefore it meshes very easily with
applications written in C. It does, of course, work just as well with application
programs written in C++. The more recently specified API for Windows application programs requiring easy and fast access to graphics cards, multimedia and other I/O devices, known as DirectX, uses a programming interface
that conforms to the component object model (COM). COM is intended to
be usable from any language, but in reality most of the examples and application programs in the VR arena use C++. Sadly, programming with COM can
be a nightmare to do correctly, and so Microsofts Visual Studio development
tools include a library of advanced C++ template classes to help. Called the
ATL, it is an enormous help to application developers, who sometimes get
interface reference counting object creation and destruction wrong. We talk
more about this topic in Section 12.4.2.
i
i
294
12.2.2
Compilers
12.3
The readme.txt file on the accompanying CD will give more details of the latest versions
of the compilers.
4
Macs based on Intel hardware are now available, so perhaps OS X will enjoy the benefits
of a vast selection of different graphics adapters after all.
i
i
295
i
i
296
12.4
If you havent worked with COM before, it can be quite a shock to be confronted by some of its peculiarities. A lot of the difficulties are due to the
sudden onslaught of a vast collection of new jargon and terminology. However, once you catch on to the key concepts and find, or are shown, some
of the key ideas, it becomes much easier to navigate through the bewildering
and vast reference documentation around COM. So, before jumping into the
examples, we thought it important to highlight the concepts of COM programming and pull out a few key constructs and ideas so that you will see
the structure. Hopefully, you will appreciate that it is not as overly complex
as it might appear at first sight. Our discussion here will be limited to those
features of COM needed for the parts of the DirectX system we intend to use.
12.4.1
What is COM?
COM grew out of the software technologies of object linking and embedding
(OLE) and ActiveX [2]. It can be summed up in one sentence:
COM is a specification for creating easily usable, distributable
and upgradable software components (library routines by another name) and are independent of the high level language in
which they were created, i.e., it is architecture-neutral.
The jargon of COM is well-thought-out but may be a little confusing at first.
Whilst it is supposed to be language-independent, it is best explained in the
context of the C++ language in which the majority of COM components are
implemented anyway. Any COM functionality that one wishes to provide or
use is delivered by a set of objects. These objects are instances of a class (called
a CoClass) that describes them. The objects are not created, nor used, in the
conventional C++ sense, i.e., by instantiating them directly with commands
such as new MyComObject1;. Instead, each COM class provides a series
of interfaces through which all external communication takes place. Every
COM interface provides a number of methods (we would call them functions)
which the application program can use to manipulate the COM object and
use its functionality. Thus, the key terms when talking about COM are:
COM object. An instance of a CoClass.
CoClass. The class describing the COM object.
i
i
297
i
i
298
Figure 12.1. A typical COM object has its unique class identifier stored in the Win-
dows registry.
12.4.2
i
i
299
i
i
300
interface may be freed, and when all the interfaces supported by a COM
object are released, the memory used by the COM object itself may be freed.
But herein lies a major problem! Get the reference counting wrong and your
program could try to use an object that no longer exists, or it could go on
creating new instances of the object. Not even exiting the program helps;
COM objects exist outside specific applications. Now, you may be a good
programmer who is very careful about freeing up the memory that you use,
but it is not always easy to tell with the DirectX library how many COM
objects are created when you call for a particular interface.
You may be asking: what has this got to do with me? The answer
is simple: you will be developing programs to make your virtual
world a reality. Those programs will want to run in real time for
a long time (days or weeks of continuous execution). You need
to use COM if you want to use any component of DirectX (and
you will), so if you get reference counting wrong, your program
will either crash the computer or cause it to grind to a halt!
Luckily, a recent development called the C++ Active Template Library (ATL)
has a special mechanism for automatically releasing COM objects when a
program no longer needs them. We shall employ this strategy in many of our
example programs. Of course, this adds yet another, and visually different,
way of writing COM code to the list, confusing the programmer even more.
To read more about COM, and there is a lot more to be read, the book by
Templeman [13] shows how COM fits in with the MFC, and Grimes et al. [6]
cover using COM in the context of the ATL.
Additional details of how to use the COM and a template for writing
COM programs is given in Appendix F.
12.5
Summary
In this chapter, we have raced through a wide range of topics, topics that could
easily occupy several volumes on their own. We have tried to keep the topics
as brief as possible and referred to other sources where each of them is explored
in much greater detail. We hope that you have gained an appreciation of the
way the application programs in the rest of the book will be structured. It
is not essential to have a comprehensive understanding of the detail here in
order to appreciate the practical examples we are now turning to, but we hope
this will make it easier to find your way through the code and thus focus on
i
i
Bibliography
301
the more important parts, where the basic frameworks metamorphose into
the specific applications.
You may have noticed that the Direct3D version mentioned here is Version 9. It may be that by the time you are reading, this it will be Version
10, 11, or even 97. The good news is because of COM, everything you read
here should still work exactly as described. We have also noticed that as each
Direct3D version changes, the differences become smaller and tend to focus
on advanced features, such as the change from Shader Model 2 to Shader
Model 3 or the changes that allow programs to take advantage of the new features of the GPU. Thus, even a new Direct3D version will almost certainly
use initiation code very similar to that given here.
In the remainder of the book, we will get down to some specific examples
of programs in 3D graphics, multimedia and custom interaction for VR.
Bibliography
[1] Apple Computer, Inc. Learning Carbon. Sebastopol, CA: OReilly & Associates, 2001.
[2] D. Chappell. Understanding ActiveX and OLE. Redmond, WA: Microsoft Press,
1996.
[3] R. Fernando (editor). GPU Gems. Boston MA: Addison-Wesley Professional,
2004.
[4] Free Software Foundation, Inc. GCC, the GNU Compiler Collection. http:
//gcc.gnu.org/.
[5] K. Gray. Microsoft DirectX 9 Programmable Graphics Pipeline. Redmond, WA:
Microsoft Press, 2003.
[6] R. Grimes et al. Beginning ATL COM Programming. Birmingham, UK: Wrox
Press, 1998.
[7] Microsoft Corporation. Visual C++ 2005 Express Edition. http://msdn.
microsoft.com/vstudio/express/visualC/.
[8] M. Pesce. Programming Microsoft DirectShow for Digital Video and Television.
Redmond, WA: Microsoft Press, 2003.
[9] M. Pharr (editor), GPU Gems 2. Boston, MA: Addison-Wesley Professional,
2005.
[10] B. E. Rector and J. M. Newcomer. Win32 Programming. Reading, MA:
Addison-Wesley Professional, 1997.
i
i
302
i
i
13
Programming 3D
Graphics in
Real Time
Three-dimensional graphics have come a long way in the last 10 years or so.
Computer games look fantastically realistic on even an off-the-shelf home
PC, and there is no reason why an interactive virtual environment cannot do
so, too. The interactivity achievable in any computer game is exactly what is
required for a virtual environment even if it will be in a larger scale. Using
the hardware acceleration of some of the graphics display devices mentioned
in Chapter 12, this is easy to achieve. We will postpone examining the programmability of the display adapters until Chapter 14 and look now at how
to use their fixed functionality, as it is called.
When working in C/C++, the two choices open to application programs
for accessing the powerful 3D hardware are to use either the OpenGL or Direct3D1 API libraries. To see how these are used, we shall look at an example
program which renders a mesh model describing a virtual world. Our program will give its user control over his navigation around the model-world and
the direction in which he is looking. We will assume that the mesh is made up
from triangular polygons, but in trying not to get too bogged down by all the
code needed for reading a mesh format, we shall assume that we have library
functions to read the mesh from file and store it as lists of vertices, polygons
and surface attributes in the computers random access memory (RAM).2
1
303
i
i
304
13.1
OpenGL had a long and successful track record before appearing on PCs for
the first time with the release of Windows NT Version 3.51 in the mid 1990s.
OpenGL was the brainchild of Silicon Graphics, Inc.(SGI), who were known
as the real-time 3D graphics specialists with applications and their special
workstation hardware. Compared to PCs, the SGI systems were just unaffordable. Today they are a rarity, although still around in some research labs
and specialist applications. However, the same cannot be said of OpenGL; it
goes from strength to strength. The massive leap in processing power, especially in graphics processing power in PCs, means that OpenGL can deliver
almost anything that is asked of it. Many games use it as the basis for their
rendering engines, and it is ideal for the visualization aspects of VR. It has, in
our opinion, one major advantage: it is stable. Only minor extensions have
been added, in terms of its API, and they all fit in very well with what was
already there. The alternate 3D software technology, Direct3D, has required
nine versions to match OpenGL. For a short time, Direct3D seemed to be
ahead because of its advanced shader model, but now with the addition of
just a few functions in Version 2, OpenGL is its equal. And, we believe, it is
much easier to work with.
13.1.1
It is worth pausing for a moment or two to think about one of the key aspects
of OpenGL (and Direct3D too). They are designed for real-time 3D graphics.
VR application developers must keep this fact in the forefront of their minds
at all times. One must always be thinking about how long it is going to take
to perform some particular task. If your virtual world is described by more
i
i
305
than 1 million polygons, are you going to be able to render it at 100 frames
per second (fps) or even at 25 fps? There is a related question: how does one
do real-time rendering under a multitasking operating system? This is not a
trivial question, so we shall return to it in Section 13.1.6.
To find an answer to these questions and successfully craft our own programs, it helps if one has an understanding of how OpenGL takes advantage
of 3D hardware acceleration and what components that hardware acceleration
actually consists of. Then we can organize our programs data and its control
logic to optimize its performance in real time. Look back at Section 5.1 to
review how VR worlds are stored numerically. At a macroscopic level, you
can think of OpenGL simply as a friendly and helpful interface between your
VR application program and the display hardware.
Since the hardware is designed to accelerate 3D graphics, the elements
which are going to be involved are:
Bitmaps. These are blocks of memory storing images, movie frames
or textures, organized as an uncompressed 2D array of pixels. It is
normally assumed that each pixel is represented by 3 or 4 bytes, the
dimensions of the image may sometimes be a power of two and/or the
pitch3 may be word- or quad-word-aligned. (This makes copying images faster and may be needed for some forms of texture interpolation,
as in image mapping.)
Vertices. Vertex data is one of the main inputs. It is essentially a
floating-point vector with two, three or four elements. Whilst image data may only need to be loaded to the display memory infrequently, vertex data may need to change every time the frame is rendered. Therefore, moving vertex data from host computer to display adapter is one of the main sources of delay in real-time 3D. In
OpenGL, the vertex is the only structural data element; polygons are
implied from the vertex data stream.
Polygons. Polygons are implied from the vertex data. They are rasterized4 internally and broken up into fragments. Fragments are the basic
visible unit; they are subject to lighting and shading models and image
3
The pitch is the number of bytes of memory occupied by a row in the image. It may or
may not be the same as the number of pixels in a row of the image.
4
Rasterization is the process of finding what screen pixels are filled with a particular polygon after all necessary transformations are applied to its vertices and it has been clipped.
i
i
306
mapping. The final color of the visible fragments is what we see in the
output frame buffer. For most situations, a fragment is what is visible
in a pixel of the final rendered image in the frame buffer. Essentially,
one fragment equals one screen pixel.
Like any microcomputer, a display adapter has three main components
(Figure 13.1.1):
I/O. The adapter receives its input via one of the system buses. This
could be the AGP or the newer PCI Express interface5 . From the programs viewpoint, the data transfers are blocks of byte memory (images
and textures) or floating-point vectors (vertices or blocks of vertices),
plus a few other state variables. The outputs are from the adapters
video memory. These are transferred to the display device through
either the DVI interface or an A-to-D circuit to VGA. The most significant bottleneck in rendering big VR environments is often on the
input side where adapter meets PC.
Figure 13.1. The main components in a graphics adapter are the video memory, parallel and pipeline vertex and fragment processors and video memory for textures and
frame buffers.
5
The PCI Express allows from much faster read-back from the GPU to CPU and main
memory.
i
i
307
RAM. More precisely video or vRAM. This stores the results of the
rendering process in a frame buffer. Usually two or more buffers are
available. Buffers do not just store color pixel data. The Z depth buffer
is a good example of an essential buffer with its role in the hidden
surface elimination algorithm. Others such as the stencil buffer have
high utility when rendering shadows and lighting effects. vRAM may
also be used for storing vertex vectors and particularly image/texture
maps. The importance of texture map storage cannot be overemphasized. Realistic graphics and many other innovative ideas, such as doing
high-speed mathematics, rely on big texture arrays.
Processor. Before the processor inherited greater significance by becoming externally programmable (as discussed in Chapter 14), its primary
function was to carry out floating-point vector arithmetic on the incoming vertex data ,e.g., multiplying the vertex coordinates with one
or more matrices. Since vertex calculations and also fragment calculations are independent of other vertices/fragments, it is possible to
build several processors into one chip and achieve a degree of parallelism. When one recognizes that there is a sequential element in the
rendering algorithmvertex data transformation, clipping, rasterization, lighting etc.it is also possible to endow the adapter chipset with
a pipeline structure and reap the benefits that pipelining has for computer processor architecture. (But without many of the drawbacks such
as conditional branching and data hazards.) Hence the term often used
to describe the process of rendering in real time, passing through the
graphics pipeline.
So, OpenGL drives the graphics hardware. It does this by acting as a state
machine. An application program uses functions from the OpenGL library to
set the machine into an appropriate state. Then it uses other OpenGL library
functions to feed vertex and bitmap data into the graphics pipeline. It can
pause, send data to set the state machine into some other state (e.g., change
the color of the vertex or draw lines rather than polygons) and then continue
sending data. Application programs have a very richly featured library of
functions with which to program the OpenGL state machine; they are very
well documented in [6]. The official guide [6] does not cover programming
with Windows very comprehensively; when using OpenGL with Windows,
several sources of tutorial material are available in books such as [7].
Figure 13.2 illustrates how the two main data streams, polygon geometry
and pixel image data, pass through the graphics hardware and interact with
i
i
308
Figure 13.2. The OpenGL processing pipeline: geometric 3D data in the form of vertex data and image data in the form of pixels pass through a series of steps until they
appear as the output rendered image in the frame buffer at the end of the pipeline.
each other. It shows that the output from the frame buffers can be fed back
to the host computers main memory if desired. There are two key places
where the 3D geometric data and pixel image interact: in the rasterizer they
are mixed together so that you could, for example, play a movie in an inset
area of the window or make a Windows window appear to float around with
the geometric elements. Pixel data can also be stored in the hardwares video
RAM and called on by the fragment processor to texture the polygons. In
OpenGL Version 1.x (and what is called OpenGLs fixed functionality), shading and lighting are done on a per-vertex basis using the Gouraud model.
Texture coordinates and transformations are also calculated at this stage for
each primitive. A primitive is OpenGLs term for the group of vertices that,
taken together, represent a polygon in the 3D model. It could be three, four
or a larger group of vertices forming a triangle strip or triangle fan (as detailed
in Section 5.1). In fixed-functionality mode, the fragment processing step is
fairly basic and relates mainly to the application of a texture. We shall see
in Chapter 14 that with OpenGL Version 2, the fragment processing step
becomes much more powerful, allowing lighting calculation to be done on a
per-fragment basis. This makes Phong shading a possibility in real time. The
full details of the processing which the geometry undergoes before rasteriza-
i
i
309
tion is given in Figure 13.3. It is worth also pointing out that the frame buffer
is not as simple as it appears. In fact, it contains several buffers (Z depth, stencil etc.), and various logical operations and bit blits6 can be performed within
and among them.
The two blocks transform & light and clip in Figure 13.2 contain quite a bit of detail; this is expanded on in Figure 13.3.
As well as simply passing vertex coordinates to the rendering
hardware, an application program has to send color information, texture coordinates and vertex normals so that the lighting
and texturing can be carried out. OpenGLs implementation as a
state machine says that there are current values for color, surface
normal and texture coordinate. These are only changed if the
6
Bit blit is the term given to the operation of copying blocks of pixels from one part of the
video memory to another.
i
i
310
application sends new data. Since the shape of the surface polygons is implied by a given number of vertices, these polygons
must be assembled. If a polygon has more that three vertices,
it must be broken up into primitives. This is primitive assembly. Once the primitives have been assembled, the coordinates
of their vertices are transformed to the cameras coordinate system (the viewpoint is at (0, 0, 0); a right-handed coordinate system
is used with +x to the right, +y up and z pointing away from the
viewer in the direction the camera is looking) and then clipped to
the viewing volume before being rasterized.
To summarize the key point of this discussion: as application developers,
we should think of vertices first, particularly how we order their presentation
to the rendering pipeline. Then we should consider how we use the numerous frame buffers and how to load and efficiently use and reuse images and
textures without reloading. The importance of using texture memory efficiently
cannot be overemphasized.
In Figure 13.4, we see the relationship between the software components
that make up an OpenGL system. The Windows DLL (and its stub library
file) provides an interface between the application program and the adapters
Figure 13.4. A program source uses a stub library to acquire access to the OpenGL
API functions. The executable calls the OpenGL functions in the system DLL, and
this in turn uses the drivers provided by the hardware manufacturer to implement
the OpenGL functionality.
i
i
311
13.1.2
//
//
//
//
//
//
//
//
Drawing directly to the screen like this can give an unsatisfactory result.
Each time we draw, we start by clearing the screen and then render one polygon after another until the picture is complete. This will give the screen the
appearance of flickering, and you will see the polygons appearing one by one.
This is not what we want; a 3D solid object does not materialize bit by bit. To
solve the problem, we use the technique of double buffering. Two copies of
the frame buffer exist; one is visible (appears on the monitor) and one is hidden. When we render with OpenGL, we render into the hidden buffer, and
once the drawing is complete, we swap them over and start drawing again
in the hidden buffer. The swap can take place imperceptibly to the viewer
because it is just a matter of swapping a pointer during the displays vertical
blanking interval. All graphics hardware is capable of this.
Under Windows, there is a very small subset of platform-specific OpenGL
functions that have names beginning wgl. These allow us to make an implicit
connection between the Windows window handle hWnd and OpenGLs draw-
i
i
312
Listing 13.1. To use OpenGL in Windows, it is necessary to handle four messages: one
to initialize the OpenGL drawing system WM CREATE, one to perform all the necessary rendering WM PAINT, one to release OpenGL WM DESTROY and WM RESIZE
(not shown) to tell OpenGL to change its viewport and aspect ratio.
i
i
313
Listing 13.2. A pixel format tells OpenGL whether it is to render using double buffer-
window. Listing 13.2 describes the key function to give the window certain
attributes, such as double buffering and stereoscopic display.
13.1.3
There are a vast number of ways one could write our visualization program.
Design questions might include how the user navigates through the scene,
whether it is a windowed or full screen program etc. Our desire is to try to
keep the code as focused as possible on the 3D programming issues. Therefore, we wont include a fancy toolbar or a large collection of user dialog
boxesyou can add these if desired. We will just give our program a simple
menu and a few hot keys for frequently used commands. Neither will we
i
i
314
present every aspect of the program here in print. All the programs are included on the CD; you can explore that using the book as a guide. In the text,
we will stress the structure of the program and its most important features.
The options the program will offer its user are:
Render the object in windowed or full screen mode. You may remember that to render a full screen using OpenGL, you first make a window that has the same dimensions as the desktop. A child window will
be used to receive the OpenGL-rendered output, and when it is created, the decision can be made as to whether it will be full-screen or
windowed.
Change the color of the background. This is simply done by changing
the background color during initialization.
Slowly rotate the model. A Windows timer going off every 10 ms or so
will change the angular orientation of the model.
Allow the user to rotate the model in front of her by using the mouse.
If the mesh makes use of image maps, display them. Try to use the
map in a manner close to the way it would appear when rendered by
the CAD or animation package which originated the mesh. (The fixedfunctionality pipeline limits the way an image may be repeated; mosaic
tiling is not normally available, for example.)
Offer the best possible rendering acceleration by using display lists and
where possible polygon stripping (described in Section 13.1.5).
In the following sections, we will look at the strategy used to implement
the application by concentrating on the most important statements in the
code.
13.1.4
Setting Up
A main window, with menu bar, is created in the standard manner; that is,
in the manner described in Listings D.2 and D.3. Global variables (see Listing 13.3) are declared and initialized. The application-defined structures for
the vertices, polygons and image maps which describe the mesh models (see
Section 5.1) are shown in Listing 13.4. Code for the functions that parse the
file for vertex, polygon and image map data and build lists of these key data
in global memory buffers will not be discussed further in the text, because
i
i
VERTEX
FACE
MAP
long
int
// User
BOOL
// User
GLfloat
GLfloat
GLfloat
315
*MainVp=NULL;
*MainFp=NULL;
*MapsFp=NULL;
Nvert=0,Nface=0,
NvertGlue=0,Nmap=0;
Listing 13.3. The applications most significant global variables for the mesh data and
viewpoint.
// application defined
// datatypes for 3D vectors
// of different kind
//
//
//
//
unsigned char color[3],matcol[3], //
texture,map,axis;
} FACE;
//
//
//
char filename[128],
//
frefl[128],fbump[128];//
...
//
} MAP;
Listing 13.4. Data structures for the 3D entities to be rendered. The functions which
read the OpenFX and 3ds Max 3DS data files convert the information into arrays of
vertices, triangular polygons and image maps in this format.
i
i
316
it would take up too much space. If you examine the project on the CD,
you will see that the code can handle not only OpenFX mesh models but
also the commercial and commonly used 3DS mesh format. This brings up
the interesting point of how different applications handle 3D data and how
compatible they are with the OpenGL model.
The OpenFX and 3DS file formats are typical of most 3D computer
animation application programs: they use planar polygons, with three or four
case WM_TIMER:
if(IsWindow(hWndDisplay)){
round_angle += 0.5;
//
//
//
//
//
Listing 13.5. Message processing of commands from the main windows menu.
i
i
317
vertices at most. Many 3D CAD software packages, however, use some form
of parametric surface description, NURBS or other type of surface patch.
OpenGL provides a few NURBS rendering functions, but these are part of the
auxiliary library and are unlikely to be implemented in the GPU hardware,
so they tend to respond very slowly.
Even within applications which concentrate on simple triangular or quadrilateral polygons, there can be quite a bit of difference in their data structures.
OpenFX, for example, assigns attributes on a per-polygon basis, whereas 3DS
uses a materials approach, only assigning a material identifier to each polygon, and thus whole objects or groups of polygons have the same attributes.
OpenGL assumes that vertices carry the attributes of color etc., and so in the
case of rendering polygons from an OpenFX mesh, many calls to the functions that change the color state variable are required. Another difference
between OpenFX and 3DS concerns mapping. Fortunately, both packages
support mapping coordinates, and this is compatible with the way OpenGL
handles image mapping.
In the basic fixed-functionality pipeline of OpenGL, the lighting and
shading models are somewhat limited when compared to the near-photorealistic
rendering capability of packages such as 3ds Max, and some compromises
may have to be accepted.
Often, hardware acceleration poses some constraints on the way in which
the data is delivered to the hardware. With OpenGL, it is much more efficient to render triangle strips than individual triangles. However, often triangle meshes, especially those that arise by scanning physical models, are not
recorded in any sort of order, and so it can be well worth preprocessing or
reprocessing the mesh to try to build it up from as many triangle strips as
possible.
Most 3D modeling packages use a right-handed coordinate system in
which the z-axis is vertical. OpenGLs right-handed coordinate system has
its y-axis vertical, and the default direction of view is from (0, 0, 0) looking
along the z-direction. With any mesh model, we must be careful to remember this and accommodate the scales involved, too. In our 3D application
examples, we will always scale and change the data units so that the model
is centered at (0, 0, 0), the vertex coordinates conform to OpenGLs axes and
the largest dimension falls in the range [1, +1].
The main windows message-handling function processes a number messages. Listing 13.5 singles out the WM TIMER message, which is used to increment the angular rotation of the model every few milliseconds and
WM COMMAND to handle commands from the user menu.
i
i
318
//
//
//
//
WS_POPUP | WS_CAPTION | WS_SYSMENU | WS_THICKFRAME,//
64,64, 64+512, 64+512,
//
//
hWndParent,NULL,hInstance,NULL);
}
else{
hwnd=CreateWindowEx(
0,
"FSCREEN",
"FSCREEN",
WS_POPUP,
0,
0,
GetSystemMetrics(SM_CXSCREEN),
GetSystemMetrics(SM_CYSCREEN),
hWndParent, NULL, hInstance, NULL );
}
ShowWindow(hwnd,SW_SHOW);
InvalidateRect(hwnd,NULL,FALSE);
global variable
a basic window
macro with text
class name
standard decoration
screen position
and size
// the size of
// the display
// screen
return hwnd;
}
Listing 13.6. Creating the OpenGL display window in either full-screen or windowed
mode.
i
i
319
Listing 13.7. The OpenGL windows message handling function. The WM CREATE
message handles the standard stuff from Listing 13.1 and calls the function
initializeGL(..) to handle application-specific startup code.
i
i
320
Listing 13.8. Handling mouse messages allows users to change their viewpoint of the
mesh model
i
i
321
GLubyte checkImage[checkImageWidth][checkImageHeight][3];
void makeCheckeredImageMap(void){
// This function uses the code from the OpenGL reference
// manual to make a small image map 64 x 64 with a checkerboard
// pattern. The map will be used as a drop in replacement
// for any maps that the mesh model uses but that cant be created
// beause, for example, the image file is missing.
...
// The KEY OpenGL functions that are used to make a texture map follow
glBindTexture(GL_TEXTURE_2D,CHECK_TEXTURE);
// define a 2D (surface) map
// this loads the texture into the hardwares texture RAM
glTexImage2D(GL_TEXTURE_2D, 0, 3, checkImageWidth,
checkImageHeight, 0, GL_RGB, GL_UNSIGNED_BYTE,
&checkImage[0][0][0]);
// very many parameters can be applied to image maps e.g.
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT);
.. // consult the reference documentation for specific detail
glHint(GL_PERSPECTIVE_CORRECTION_HINT,GL_NICEST);
... //
}
Listing 13.9. Additional OpenGL configuration for lighting, a default image map,
viewport and those camera properties (including its location at the origin, looking
along the z-axis) that dont change unless the user resizes the window.
is captured for exclusive use by this window (even if it moves outside the
window) and its screen position recorded. As the mouse moves, the difference
in its current position is used to adjust the orientation of the mesh by rotating
it about a horizontal and/or vertical axis. The right mouse button uses a
similar procedure to change the horizontal and vertical position of the model
relative to the viewpoint.
OpenGL initialization which uses the standard set-up shown in Listings 13.1 and 13.2 is extended by the application-specific function,
i
i
322
i
i
323
void FreeObject(void){
///// Free up all the memory buffers used to hold
...// the mesh detail and reset pointers
MainVp=NULL; MainFp=NULL; MapsFp=NULL;
Nvert=0; Nface=0; NvertGlue=0; Nmap=0;
}
void LoadModel(char *CurrentDirFile, int fid){
// load the mesh
FreeObject(); // make sure existing mesh is gone
if(fid == FI_3DS)Load3dsObject(CurrentDirFile); // select loading function
else
LoadOfxObject(CurrentDirFile); // by filename extension
ListAdjFaces();
// find a list of all polygons which are adjaent
// to each other
FixUpOrientation();
// make sure all the faces have consistent normals
StripFaces();
// order the polygons into strips if possible
}
int LoadOfxObject(char *FileName){ // load the OpenFX model - the file
// is segmented
int fail=1;
// into chunks
long CHUNKsize,FORMsize;
if( (fp=fopen(FileName,"rb")) == NULL)return 0;
// check headers to see if this is an OpenFX file
if (fread(str,1,4,fp) != 4 || strcmp(str,"FORM") != 0){ fail=0; goto end; }
if (getlon(&FORMsize) != 1) goto end;
if (fread(str,1,4,fp) != 4 || strcmp(str,"OFXM") != 0){ fail=0; goto end; }
loop: // we have an OpenFX file so go get the chunks
if (fread(str,1,4,fp) != 4 ) goto end;
if (getlon(&CHUNKsize) != 1) goto end;
if
(strcmp(str,"VERT") == 0){ReadVertices(CHUNKsize);}
// read the vertices
else if(strcmp(str,"FACE") == 0)ReadOldFaces(CHUNKsize,0);
// read the polygons
else if ... // read other required chunks
else getchunk(CHUNKsize); // read any chunks not required
goto loop;
end:
fclose(fp);
return fail; // 1 = success 0 = failure
}
Listing 13.10. Loading the OpenFX and 3ds Max mesh models involves parsing the
data file. The files are organized into chunks. After reading the data, it is processed
to make sure that all the polygons have a consistent surface normal. The function
StripFaces() provides a rudimentary attempt to optimize the order of the triangular polygons so that they are amenable to OpenGLs acceleration.
i
i
324
13.1.5
If you want to see all the detail of how the vertex, polygon and image map
information is extracted from the 3D data files, consult the actual source code.
Listing 13.10 gives a brief summary. Both the OpenFX and 3DS mesh files
Figure 13.5. OpenGL provides a mechanism for sending strips of triangles to the
i
i
325
Figure 13.6. The stripping algorithm tries to build strips of adjacent triangular poly-
gons. This helps to reduce the number of vertices that have to be rendered by
OpenGL. For example, if you can make a strip from 100 adjacent polygons, you
reduce the number of glVertex..() calls from 300 to 201.
i
i
326
use a storage scheme based on chunks. There is a chunk to store the vertex
data, one to store the triangle face/polygon data and numerous others. Each
chunk comes with a header (OpenFX has a four-character ID and 3DS a
two-byte magic number) and a four-byte length. Any program parsing this
file can jump over chunks looking for those it needs by moving the file pointer
forward by the size of a chunk.
A fairly simplistic algorithm which attempts to locate strips of triangles
such as those in the mesh illustrated in Figure 13.5 is provided by function
StripFaces(). The strategy of the algorithm is summarized in Figure 13.6.
The algorithm locates groups of polygons that can form strips and sets the
member variable next in each polygons data structure so that it behaves like
a linked list to vector a route through the array of polygons stored in RAM
at an address pointed to by MainFp. By following this trail, the polygons can
be rendered as a triangle strip, thus taking advantage of OpenGLs rendering
acceleration for such structures.
13.1.6
We shall discuss rendering with the use of five listings so as to highlight the
differences which occur when one uses OpenGL to render simply-shaded
polygons, polygons that make up the whole or part of a smooth, rather than
faceted, object. We shall also introduce the display list, which allows further
optimization of the rendering process.
Display lists are very useful because they allow us to prepare a kind of
ready-made scene, one that can be quickly configured and rendered without
having to load all the vertices and images from the host computers RAM.
For a VR application, a display list can be used to store all the OpenGL
commands to render a mesh model, independent of taking any particular
view. The programs user can then view the object from any angle by calling
the display list after position and rotation transformations are set up. Using a
display list works in this case because the mesh does not change over time. If the
mesh were deforming or moving, the list would have to be regenerated after each
change to the mesh.
The function which is called to render a particular view or in response
to a WM PAINT message appears in Listing 13.11. Note that a transformation
of 5.0 is applied to the negative z-axis; this moves the model away from
the camera so that it all is visible, i.e., we see the outside. If the model were
of a virtual environment, we would wish to keep it centered at (0, 0, 0) and
i
i
327
//
//
glShadeModel(GL_SMOOTH);
//
glPushMatrix();
//
//
glLoadIdentity();
//
glTranslatef(xpos_offset, ypos_offset, -5.0); //
glRotatef(up_angle,1.0,0.0,0.0);
//
glRotatef(round_angle+180.0,0.0,1.0,0.0);
//
glScalef(view_scale, view_scale, view_scale); //
if(listID > 0){
//
glEnable(GL_LIGHTING);
//
glCallList(listID);
//
//
glDisable(GL_LIGHTING);
}
glPopMatrix();
//
//
glFlush();
//
//
expand it so that we can move about inside it. We should, however, still make
sure that it all lies on the camera side of the rear clipping plane.
Compiling a display list is not a complex process because the normal
rendering commands are issued in exactly the same way that they would be
when rending directly to the frame buffer:
..
glNewList(listID,GL_COMPILE);
// start to build a Display List
glBegin(GL_TRIANGLES);
// render triangular polygons by calling glVertex() three times
glEnd();
glEndList();
// end the display list
..
To build a display list for the mesh, we must loop over every face/polygon
and call glVertex..() for each of its three vertices. To achieve the correct
lighting, a surface normal will also have to be defined at each vertex. When a
i
i
328
Listing 13.12. A display list holds all the commands to render the polygons in a mesh
model. Changes of color apply to all subsequent vertices added to the list, and so
each time a change of color is detected, a new set of material property specifications
have to be added to the display list.
i
i
329
glEnd();
if(Mapped){
..... // Render any polyongs with image maps - see next listing
}
glEndList();
free(nv);
}
13.1.7
Image Mapping
Listing 13.13. Rendering image mapped polygons by looping over the image maps
and polygons. When a polygon is textured with a map, the appropriate texture
coordinates are calculated and the vertices passed to OpenGL.
7
The terms image map, texture, shader are often loosely used to refer to the same thing
because they all result in adding the appearance of detail to a polygonal surface without that
detail being present in the underlying mesh and geometry. Whilst we would prefer to reserve
the words shader and texture for algorithmically generated surface detail, we will, when talking
generally, use them interchangeably.
i
i
330
else {
// and ordinary map painted onto surface
Image=LoadMAP(MapsFp[k].filename,&xm,&ym);
refmap=FALSE;
}
if(Image != NULL)makeNextMap(Image,xm,ym); // If map pixels were created
// make the OGL map in
// texture RAM.
glEnable(GL_TEXTURE_2D);
// render from a map instead of
// a color
// Tell OGL which type of texture to use and which image map to use.
// If we could
// not load the map we tell OGL to use the simple checker board texture
if(Image != NULL)glBindTexture(GL_TEXTURE_2D,IMAGE_TEXTURE);
else
glBindTexture(GL_TEXTURE_2D,CHECK_TEXTURE);
// For OpenFX models we need to get its map parameters so that
// we can generate
// texture coordinates.
GetMapNormal(MapsFp[k].map,MapsFp[k].p,MapsFp[k].x,MapsFp[k].y,mP,mX,mY,mN);
glBegin(GL_TRIANGLES);
// now render the polygons
fp=MainFp; for(i=0;i<Nface;i++,fp++){
if((fp->texture & 0x40)!=0x40)continue; // skip if polygon not mapped
brushID=(fp->brush & 0x1f);
// which image ?
if(brushID != k)continue;
// skip if not this image map
v0=(MainVp+fp->V[0]); v1=(MainVp+fp->V[1]); v2=(MainVp+fp->V[2]);
if(Normalize(v0->p,v1->p,v2->p,n)){
// get polygon normal
for(j=0;j<3;j++){
Vi=fp->V[j]; v=(MainVp+Vi);
x=((GLfloat)(v->p[0]-c[0]))*scale; // scale and center model
..// same for y and z ccords
; // in view volume
if(smooth_shading && ((fp->texture >> 7) != 0)){ // smoothed
glNormal3f(nv[Vi][0], nv[Vi][1],nv[Vi][2]);
} else glNormal3f( n[0], n[1], n[2]);
// not smoothed
if(MapsFp[k].map == MAP_BY_VERTEX){
// texture coords exist - use them!!
alpha=(GLfloat)v->x; beta=(GLfloat)v->y;
}
else if(MapsFp[k].map == CYLINDER || // OpenFX model - must
// calculate texture coords
MapsFp[k].map == CYLINDER_MOZIAC)
GetMappingCoordC(mN,mX,mY,mP,MapsFp[k].angle,
v->p,&alpha,&beta);
else GetMappingCoordP(mN,mX,mY,mP,v->p,&alpha,&beta);
if(!refmap)glTexCoord2f(alpha,beta);
glVertex3f(x,z,-y);
} } }
glEnd();
glDisable(GL_TEXTURE_2D);
if(refmap){
glDisable(GL_TEXTURE_GEN_S);
glDisable(GL_TEXTURE_GEN_T);
}
}
} // end of rendering image maps
i
i
331
as the non-real-time rendering engines. There are few effects that cannot be
obtained in real-time using image mapping.
In the example we are working on, we shall not deploy the
full spectrum of compound mapping where several maps can be
mixed and applied at the same time to the same polygons. Neither will we use some of the tricks of acceleration, such as loading
all the pixels for several maps as a single image. In many cases,
these tricks are useful, because the fixed-functionality pipeline
imposes some constraints on the way image maps are used in the
rendering process.
Listing 13.13 shows how the parts of the mesh model which have textured
polygons are passed to OpenGL. The strategy adopted is to run through all
the image maps used by the mesh model (there could be a very large number
of these). For each map (map i), its image is loaded and the pixels are passed
to OpenGL and loaded into texture RAM. After that, all the polygons are
checked to see whether they use map i. If they do then the texture coordinates
are obtained. OpenGL requires per-vertex texture coordinates, and, therefore,
if we are loading an OpenFX model, we must generate the vertex texture
coordinates. 3ds Max already includes texture coordinates for each vertex,
and so these meshes they can be passed directly to OpenGL.
The code that is necessary to process the image by loading it, decoding
it and making the OpenGL texture is shown in Listing 13.14. To round
off the discussion on image mapping, if you look back at Listing 13.13, you
will notice the calls to the functions glEnable(GL TEXTURE GEN S); and
glEnable(GL TEXTURE GEN T);. If these parameters are enabled, OpenGL
will ignore any texture coordinates specified for vertices in the model and
instead generate its own based on the assumption that what one sees on the
mesh surface is a reflection, as if the image were a photograph of the surrounding environment. This is the way OpenGL simulates reflective/mirrored surfaces and the extremely useful application of environment maps as discussed
in Section 7.6.
This takes us about as far as we need to go with OpenGL for the moment.
We have seen how to render image-mapped polygons from any viewpoint and
either move them or move the viewpoint. We can use a good lighting model
and can draw objects or virtual worlds constructed from many thousands of
planar triangular or other polygonal primitives, in real time. It is possible to
extend and use this code, so we shall call on it as the basis for some of the
i
i
332
Listing 13.14. Creating the image map requires three steps: it must be loaded and
decompressed, then it has to be converted into a form that OpenGL can use (i.e.,
its height and width must be a power of two) then it can be passed to OpenGL and
loaded into the texture RAM on the display adapter.
i
i
333
projects in Chapter 18. We shall now turn to look briefly at how OpenGL
can be used in a system independent manner.
13.2
The OpenGL Utility Toolkit [3] is a useful library of functions that removes
completely the system dependent element from OpenGL application programs. It is available in binary form or as source code for UNIX, Windows and Macintosh platforms. Using it, programs can be put together very
i
i
334
.. // standard headers
#include <GLUT/glut.h>
.. // other headers, program variables and function prototypes
// (all standard C data types! )
void Render() {
// Called by GLUT
render the 3D scene
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
..
// set up view transformations
glCallList(listID);
..
glFinish();
glutSwapBuffers();
// GLUT function
}
The outline program in this listing shows a version of the mesh viewer developed in
Section 13.1 which uses the GLUT. It was developed for the Mac OS X using the
Xcode tools.
quickly. If you are writing a program and do not need to achieve a specific
look or feel or tie into some system-dependent functionality, the GLUT is the
obvious way to use OpenGL.
The GLUT has its own programming model which is not that dissimilar from the Windows approach. It uses a window of some kind to display
the contents of the frame buffer. Initialization functions create this window
and pass to the GLUT library a number of application-specific callbacks8 to
support mouse and keyboard input. The application programmer must supply the appropriate OpenGL drawing and other commands in these callback
functions. Listing 13.15 shows the key elements of a version of our mesh
model viewer using the GLUT and running under the Macintosh OS X operating system. The application would be equally at home on Windows or an
SGI UNIX system. It is a good illustration of just how platform-independent
OpenGL can be.
The example given in this section only demonstrates a few key features
of the utility library. Menu commands, mouse clicks and multiple windows
can also be used. Unless one really must have some system-specific functions
in a 3D application, it is a good idea to use the GLUT or some other multiplatform framework so as to maintain as much host flexibility as possible.
8
A callback is a function like a window message handler that is part of the program, but
it is not called explicitly by the program itself but by another program or system or even the
operating system. It is called indirectly, or put it another way, it is called back, hence the name.
i
i
335
void Reshape(int w,int h){ // called by GLUT when window changes shape
glViewport(0, 0, w, h);
}
void Init(){ // initialize openGL - called after window is created
..
glClearColor(gbRed,gbGreen,gbBlue,1.0);
..
glEnable(GL_LIGHT0);
}
void KeyHandler(unsigned char key,int x,int y){ // called by GLUT
switch(key){
case l:
// user pressed the l key to
// load new mesh
LoadOFXmesh(GetFileName());
// or 3DS mesh
Make3dListS();
// make the openGL display list
break;
};
}
void TimerFunc(int value){ // called by GLUT
round_angle += 5.0; up_angle += 1.0; Render(); // change view and render
glutTimerFunc(20,TimerFunc,0);
// run again in 20 ms
}
int main(int argc,char **argv){
// main program entry point
glutInit(&argc,argv);
glutInitWindowSize(640,480);
// set up window (size given)
glutInitWindowPosition(100,100);
// postion on screen
glutInitDisplayMode(GLUT_RGBA|GLUT_DOUBLE|GLUT_DEPTH);
glutCreateWindow("Mesh Viewer for Mac OS X");
glutDisplayFunc(Render);
// Tell GLUT the function to use
// to render the scene
glutReshapeFunc(Reshape);
// Tell GLUT the function to use
// for window size change
glutKeyboardFunc(Keyhandler);
// Tell GLUT the function to handle
// keyboard keys
glutTimerFunc(100,TimerFunc,0);
// Tell GLUT to run this function
// in 100ms time
Init();
// initialize our OpenGL requirements
glutMainLoop();
// Wait in a loop and obey commands
// till window closed
return 0;
}
i
i
336
13.3
Direct3D sets out to achieve the same result as OpenGL, but it does not have
such a long history and it is evolving at a much faster rate. For this reason,
we will not use it as our main framework for developing the visualization
examples. We will use the other components of DirectX, i.e., DirectInput
and DirectShow for all other software/hardware interfacing because it works
well and there are very few alternatives for Windows applications.
13.3.1
A Bit of History
DirectX [4] first appeared under the guise of the Games SDK for Windows 95
in late 1995, and it has been refined and updated many times. At first, it only
gave the ability to draw directly to the display hardware, and for the graphics
programmer that was a breath of fresh air. This is the DirectDraw component of DirectX. It offered drawing speeds as fast as could be obtained if one
accessed the display adapter without using operating system mediation, say
by using assembler code. DirectDraw was designed with 2D computer games
in mind, so it had excellent support for blits, bit-block memory transfers, including transparent and key color blits. Many operations are performed
asynchronously with page flipping and color fills done in the display adapters
hardware.
All that the DirectDraw API provides is fast access to a drawing canvas,
the equivalent of the output frame buffer. To render virtual worlds or anything else in 3D, one still has to render into the frame buffer, and this is
the function of Direct3D. The first few versions of Direct3D appeared in the
1990s. Back then, hardware support for 3D functions was still very limited
and so Direct3D had two modes of operation, called retained mode and immediate mode. Retained mode [2] implemented most of the 3D functionality
in library code running in the hosts own processor, not on the GPU. Immediate mode was limited to a few simple functions and only managed to provide
some very basic lighting models, for example. Its functions matched the capabilities of the hardware rather than the desires of the application programs.
Now everything has changed. Retained mode has disappeared,9 and immediate mode has evolved to map onto the powerful features provided in hard9
Not quite, because it uses COM architecture and so is still in there, inside the Direct3D
system. It is just not documented.
i
i
337
Figure 13.7. Two examples of shader development systems. On the left, a basic shader
development program is illustrated. It presents the C-like shader code that Microsoft
calls the High Level Shader Language; NVIDIAs similar version is Cg. On the right is
a screen from ATIs RenderMonkey program, which can produce fabulously exciting
shader programs.
13.3.2
i
i
338
Figure 13.8. Direct 3D system overview. Note the programmable branch of the graph-
ics pipeline. Primitive data refers to bitmaps and any other non-vertex input.
i
i
339
a function cannot be performed by the hardware, the HAL does not report
it as a hardware capability. In DirectX9, the HAL can take advantage of the
hardwares ability to do all the vertex processing or it can fall back to software
processing. Its up to the display manufacturer to decide.
It is difficult to say whether Direct3D is a better environment for doing
3D graphics than OpenGL. Certainly, the architects of OpenGL had the luxury of designing their system for top-of-the-range workstations whilst those
putting Direct3D together had to be content with the vagaries of an operating system that did not have a good reputation for any sort of graphics at all.
Since both systems rely so heavily on the hardware of the graphics cards, one
should comment to developers of 3D chipsets and their drivers if either Direct3D or OpenGL appear to differ markedly in their speed or performance.
13.3.3
i
i
340
Figure 13.10. The geometry, coordinate systems and ordering of polygon vertices for
length 2. In the OGL coordinate system, the point 0 is located at (1, 1, 1).
In the D3D system, it is at (1, 1, 1) ).
Another difference is the approach taken when rendering in a full-screen
mode. Since D3D is a Microsoft creation, it can hook directly into the operating system and render directly to the screen without using the normal
niceties of giving way to other applications. As a result, trying to mix D3D
full-screen output with other windows does not work awfully well. In fact, for
some games that use D3D, the display may be switched into an entirely different screen resolution and color depth.10 On the other hand, OGL obtains
its full-screen performance by drawing into a conventional window devoid of
decoration (border menu etc.) and sized so that its client area exactly matches
the size of the desktop.
If one looks at the sample programs that ship with the D3D SDK, at first
sight it can be quite frightening because there seems to be a massive quantity of code dedicated to building even the simplest of programs. However,
things are not quite as complex as they seem, because most of this code is
10
You can sometimes detect this on CRT display by hearing the line/frame scan oscillator
changing frequency.
i
i
341
13.3.4
i
i
342
#include <d3dx9.h>
// Header file from SDK
// these are local poiters to the D3D opbjects
LPDIRECT3D9
g_pD3D
= NULL; // Used to create the D3DDevice
LPDIRECT3DDEVICE9
g_pd3dDevice = NULL; // Our rendering device
...
// Set up D3D when window is created
case WM_CREATE:
// Create the D3D object.
g_pD3D = Direct3DCreate9( D3D_SDK_VERSION );
// Set up the structure used to create the D3DDevice.
D3DPRESENT_PARAMETERS d3dpp;
ZeroMemory( &d3dpp, sizeof(d3dpp) );
d3dpp.Windowed = TRUE;
d3dpp.SwapEffect = D3DSWAPEFFECT_DISCARD;
d3dpp.BackBufferFormat = D3DFMT_UNKNOWN;
d3dpp.EnableAutoDepthStencil = TRUE;
d3dpp.AutoDepthStencilFormat = D3DFMT_D16;
// Create the D3DDevice
g_pD3D->CreateDevice( D3DADAPTER_DEFAULT, D3DDEVTYPE_HAL, hWnd,
D3DCREATE_SOFTWARE_VERTEXPROCESSING,
&d3dpp, &g_pd3dDevice ) ) );
// Z buffer
g_pd3dDevice->SetRenderState( D3DRS_ZENABLE, TRUE );
return 0;
case WM_DESTROY:
// Relese the D3D object
if( g_pd3dDevice != NULL )g_pd3dDevice->Release();
if( g_pD3D != NULL )g_pD3D->Release();
return 0;
Listing 13.16. Message handlers to create and destroy a window give convenient places
to initialize and release the Direct3D rendering system. Many applications also put
these steps in the WinMain() function on either side of the message loop.
To render the 3D scene with its geometry and image maps, we use a
function called from within the message loop.
Despite the significant difference in program style and implementation
detail between Direct3D and OpenGL, they both start with the same 3D
data and deliver the same output. Along the way, they both do more or less
the same thing: load a mesh, move it into position, set up a viewpoint, apply a
lighting and shading model and enhance the visual appearance of the surface
using image maps. Consequently, we can relatively easily rewrite any 3D
virtual world visualizer using the Direct3D API instead of the OpenGL one.
Quite a considerable fraction of the code can be reused without modification.
Loading the mesh models, establishing the Windows window and handling
i
i
343
VOID RenderScene(void){
// Clear the backbuffer and the zbuffer
g_pd3dDevice->Clear( 0, NULL, D3DCLEAR_TARGET|D3DCLEAR_ZBUFFER,
D3DCOLOR_XRGB(0,0,255), 1.0f, 0 );
// Begin the scene
if( SUCCEEDED( g_pd3dDevice->BeginScene() ) )
{
// Setup the lights and materials
// and viewing matrices to describe the scene
SetupLightsMatricesMaterials();
// USER FUNCTION
// Render the vertex buffer contents NOTE: g_pVB is a pointer to
// an object describing the scene geometry which must be loaded
// before rendering begins
g_pd3dDevice->SetStreamSource( 0, g_pVB, 0, sizeof(CUSTOMVERTEX) );
g_pd3dDevice->SetFVF( D3DFVF_XYZ|D3DFVF_NORMAL );
g_pd3dDevice->DrawPrimitive( D3DPT_TRIANGLELIST, 0, 12);
// End the scene
g_pd3dDevice->EndScene();
}
// Carry out double buffering swap
g_pd3dDevice->Present( NULL, NULL, NULL, NULL );
}
Listing 13.17. Render the scene by attaching the vertex objects to the D3D device and
the user commands requires very little change. We must remember to switch
to the left-handed coordinate system that Direct3D uses. Direct3D has no
equivalent of the display list, so rendering will have be done explicitly, every
time the window has to be refreshed.
We have already seen that in OpenGL, if we allow the color or map to
differ from one polygon to the next, rendering will slow down and it will
interfere with hardware optimization. In Direct3D, it is even more inefficient.
Consequently, we will make the assumption that our mesh model contains
only a few colors and image maps. This assumption requires us to carry out a
post-load process where we break up the model into sets of polygons. There
will be a set for every image map and a set for every color used. During the
post-load process, we will also generate texture coordinates and load the image
maps. When it comes time to render the mesh, each set of polygons will be
considered in turn and appropriate attributes or maps applied as necessary.
Now we turn to look at some of the key implementation details. However, we will omit code that performs similar tasks to that discussed in the
OpenGL examples. Using the listings given here, you should be able to follow the logic of the full sources and modify/extend them for your own needs.
i
i
344
struct CUSTOMVERTEX_0 {
D3DXVECTOR3 position;
D3DXVECTOR3 normal;
};
struct CUSTOMVERTEX_1 {
D3DXVECTOR3 position;
D3DXVECTOR3 normal;
FLOAT
u, v;
};
// These tells Direct3D what our custom vertex structures will contain
#define D3DFVF_CUSTOMVERTEX_0 (D3DFVF_XYZ|D3DFVF_NORMAL)
#define D3DFVF_CUSTOMVERTEX_1 (D3DFVF_XYZ|D3DFVF_NORMAL|D3DFVF_TEX1)
struct COLOR {unsigned char r,g,b};
LPDIRECT3D9
g_pD3D
= NULL; //
LPDIRECT3DDEVICE9
g_pd3dDevice = NULL; //
LPDIRECT3DVERTEXBUFFER9 g_pMesh[32];
//
//
//
LPDIRECT3DTEXTURE9
g_pMap{16];
//
long g_MeshVertexCount[32];
//
COLOR g_color[16];
//
long g_NumberOfMeshes=0,g_NumberOfTextures=0;//
Listing 13.18. The application uses two types of custom vertex, depending on whether
it is rendering part of a polygon mesh with a texture or a basic color. In this code,
the structures are defined and two global arrays are declared that will hold pointers
to the mesh objects and the texture maps.
13.3.5
The program begins by creating a Direct3D device (as in Listing 13.17) and
child window to display the rendered view. In response to a user command
to load the model, all the same procedures applied in the OpenGL program
are repeated. In addition, vertex coordinates are switched to conform to Direct3Ds requirements. Once this is done, using the structures defined in Listing 13.18, the post-load function of Listing 13.19 prepares an array of vertex
buffers and textures.11 We will limit the number of vertex buffers (holding
triangles with the same color and smoothing state) and textures to 16 each.
Separate vertex buffers are built from all the vertices associated with polygons
having the same properties, including the same color, and they are similarly
11
i
i
345
void CreateMeshes(void){
// create the Direc3D meshes
.. // declare local variables
.. // make list of average surface normals as in OpenGL example
fp=MainFp; for(i=0;i<Nface;i++,fp++){ // count number of different colors
if(fp->mapped)continue;
// skip any polygons with image
// maps applied
if(!existing_color(fp->color)){
// check to see if color exists
g_color[g_NumberOfMeshes]=fp->color;// keep record of color
g_NumberOfMeshes++;
// do no let exceed 16
}
g_MeshVertexCount[g_NumberOfMeshes]+=3; // 3 more vertices
}
// using color as a guide make up lists of vertices
// that will all have same color
for(k=0;k<g_NumberOfMeshes;k++){
// make a vertex buffer for this little mesh
if( FAILED( g_pd3dDevice->CreateVertexBuffer(
g_MeshVertexCount[g_NumberOfMeshes]*sizeof(CUSTOMVERTEX_0),
0, D3DFVF_CUSTOMVERTEX_0, D3DPOOL_DEFAULT, &g_pVB[k], NULL)))
return E_FAIL;
CUSTOMVERTEX_0* pVertices;
g_pVB[k]->Lock(0, 0, (void**)&pVertices, 0 ); // get pointer to vertex buffer
vtx=0;
// vertex index into buffer
fp=MainFp; for(i=0;i<Nface;i++,fp++){ // loop over all polygons
if(fp->mapped)continue;
// skip any polygons with image
// maps applied
if(fp->color == g_color[k]){
// this face has the color add
// a triangle to mesh
v0=(MainVp+fp->V[0]); v1=(MainVp+fp->V[1]); v2=(MainVp+fp->V[2]);
if(Normalize(v0->p,v1->p,v2->p,n)){// calculate surface normal
for(j=0;j<3;j++){
Vi=fp->V[j]; v=(MainVp+Vi);
x=((GLfloat)(v->p[0]-c[0]))*scale; // Scale and center the model in
y=((GLfloat)(v->p[2]-c[2]))*scale; // the view volume - correct for
z=((GLfloat)(v->p[1]-c[1]))*scale; // different coordinate system
pVertices[vtx].position = D3DXVECTOR3(x,y,z);
pVertices[vtx].normal = D3DXVECTOR3(n[0],n[1],n[2]);
if(smooth_shading && ((fp->texture >> 7) != 0)) // smoothed
pVertices[vtx].normal = D3DXVECTOR3(nv[Vi][0],nv[Vi][1],nv[Vi][2]);
else
// not smoothed
pVertices[vtx].normal = D3DXVECTOR3(n[0],n[1],n[2]);
vtx++; // increment vertex count for this bit of mesh
}
}
}
}
g_pVB[k]->Unlock(); // ready to move on to next patch of colored vertices
}
// Now do the same job for polygons that have maps applied (next listing)
CreateMapMeshes();
}
Listing 13.19. Convert the vertex and polygon data extracted from the mesh model
i
i
346
void CreateMapMeshes(){
Listing 13.20. Load the image maps into Direct3D textures and create vertex buffers
holding all the vertices associated with polygons carrying that texture.
i
i
pVertices[2*i+0].u = alpha;
pVertices[2*i+0].v = beta;
vtx++;
347
}
}
g_pVB[k+16]->Unlock(); // ready to move on to next texture
}
}
// all done
13.3.6
i
i
348
VOID SetupView()
{
// Set the model in the correct state of rotation given
D3DXMATRIXA16 matWorld;
D3DXMatrixIdentity( &matWorld );
D3DXMatrixRotationX( &matWorld,up_angle );
// tilt up./down
D3DXMatrixRotationY( &matWorld, round_angle );
// rotation
D3DXMatrixTranslation(&matWorld,0.0f,0.0f,1.0f);
// move along z axis
g_pd3dDevice->SetTransform( D3DTS_WORLD, &matWorld );
// look along the Z axis from the coordinate origin
D3DXVECTOR3 vEyePt( 0.0f, 0.0f,0.0f );
D3DXVECTOR3 vLookatPt( 0.0f, 0.0f, 5.0f );
D3DXVECTOR3 vUpVec( 0.0f, 1.0f, 0.0f );
D3DXMATRIXA16 matView;
D3DXMatrixLookAtLH( &matView, &vEyePt, &vLookatPt, &vUpVec );
g_pd3dDevice->SetTransform( D3DTS_VIEW, &matView );
// Set a conventional perspective view with a 45 degree field of view
// and with clipping planes to match those used in the OpenGL example
D3DXMATRIXA16 matProj;
D3DXMatrixPerspectiveFovLH( &matProj, D3DX_PI/4, 0.5f, 0.5f, 10.0f );
g_pd3dDevice->SetTransform( D3DTS_PROJECTION, &matProj );
}
Listing 13.22. Lighting is set up every time a frame is rendered. The light is set to a
i
i
13.4. Summary
13.3.7
349
Lighting
13.3.8
Listing 13.16 forms the template of a suitable WinMain() function with message processing loop. Rendering is done by placing a call to Render() inside
the loop. The code for the Render() function is given in Listing 13.23. This
code is quite short, because all the difficult work was done in the post-load
stage, where a set of triangular meshes and textures were created from the
data in the file. To render these meshes, the procedure is straightforward. We
loop over all the meshes for the non-textured part of the model. A material
is created with the color value for these triangles and the Direct3D device is
instructed to use it. Once the non-textured triangles have been rendered, we
turn our attention to each of the textures. Again, we loop through the list of
textures, rendering the triangular primitives for each in turn. The code looks
quite involved here, talking about TextureStages, but that is only because Direct3D has an extremely powerful texturing engine that can mix, blend and
overlay images to deliver a wonderful array of texturing effect.
Finally, add a menu and some initialization code. Just before the program
terminates, the COM objects are freed, along with any memory buffers or
other assigned resources. Once running, the program should fulfill the same
task as the OpenGL example in Section 13.1.
13.4
Summary
It has been the aim of this chapter to show how real-time rendering software
APIs are used in practice. We have tried to show that to get the best out of the
hardware, one should appreciate in general terms how it works and how the
3D APIs match the hardware to achieve optimal performance. We have omitted from the code listings much detail where it did not directly have a bearing
on 3D real-time rendering, e.g., error checking. Error checking is of course
vital, and it is included in all the full project listings on the accompanying
CD. We have also structured our code to make it as adaptable and generic as
possible. You should therefore be able to embed the 3D mesh model viewer
in container applications that provide toolbars, dialog box controls and other
i
i
350
Listing 13.23. Rending the scene involved establishing a viewpoint and lighting con-
ditions and then drawing the triangular primitives from one of two lists: a list for sets
of colored triangles and a list for sets of textured triangles.
i
i
Bibliography
351
Bibliography
[1] Advanced Micro Devices, Inc. RenderMonkey Toolsuite. http://ati.amd.com/
developer/rendermonkey/, 2006.
[2] R. S. Ferguson. Practical Algorithms for 3D Computer Graphics. Natick, MA:
A K Peters, 2001.
[3] M. J. Kilgard. The OpenGL Utility Toolkit. http://www.opengl.org/
resources/libraries/glut/, 2006.
[4] Microsoft Corporation. DirectX SDK. http://msdn.microsoft.com/directx/
sdk/, 2006.
[5] NVIDIA Corporation. Cg Toolkit 1.5. http://developer.nvidia.com/object/
cg toolkit.html, 2006.
[6] D. Shreiner, M. Woo, J. Neider and T. Davis. OpenGL Programming Guide,
Fifth Edition. Reading, MA: Addison-Wesley Professional, 2005.
[7] R. Wright and M. Sweet. OpenGL Superbible, Second Edition. Indianapolis,
IN: Waite Group Press, 1999.
i
i
14
High-Quality
3D with
OpenGL 2
i
i
354
greater realism in real time, and how its extra features allow application programs easy access to it. Specifically, we will show you how to write and use
programs for the GPU written in the OpenGL Shading Language (GLSL).1
The definitive reference source for OpenGL Version 2 is the official
guide [10], but it has few examples of shading-language programs. The original book on the shading language by Rost [9] explains how it works and
gives numerous examples which illustrate its power. To get the most out of
OpenGL, one should have both these books in ones library.
Of course, it is possible to use the Direct3D API to achieve similar
results, but our opinion is that OpenGL 2 is a lot simpler to use and apply. Most importantly, it is completely compatible with stereoscopic rendering and cross-platform development, which is obviously very important
in VR.
14.1
OpenGL Version 2
Turning back to Figures 13.2 and 13.3 in Chapter 13, you can see how the
special graphics processing chips can turn a data stream consisting of 3D
vertex and pixel information into rendered images within the frame buffer.
In OpenGL, this is termed the fixed functionality. The fixed-functionality
pipeline is also a bit restrictive because, for example, some application programs would like to make use of more realistic lighting models such as BRDF
and subsurface reflection models that are so important in rendering human
skin tones. Or, for that matter, more sophisticated shading algorithms, such
as Phong shading. The list goes on. With such a long shopping list of complex algorithms vying to be in the next generation of GPUs, doing it all with
a fixed-functionality processor was always going to be a challenge.
After some attempts to extend the hardware of the fixed functionality, all
the graphics chip manufacturers realized that the only solution was to move
away from a GPU based on fixed functionality and make the pipeline programmable. Now, one might be thinking that a programmable GPU would
have an architecture not too dissimilar to the conventional CPU. However,
the specialized nature of the graphics pipeline led to the concept of several
1
The language is called a shading language because it tells the GPU how to shade the
surfaces of the polygonal models in the rendering pipeline. This can include arbitrarily complex descriptions of light absorption, diffusion, specularity, texture/image mapping, reflection,
refraction, surface displacement and transients.
i
i
355
i
i
356
Figure 14.1. The OpenGL processing pipeline with its programmable processors. Pro-
grams are written in a high-level language, compiled and loaded into their respective
processing units.
on the size of a program. Again, for example, the Realizm chip allows 1k instructions for a program in the vertex processor and 256k instructions in the
fragment processor.
14.1.1
How It Works
In Version 2 of OpenGL, a small number of extra functions have been included to enable an application program to load the shader program code into
the GPUs processor and switch the operation of the GPU from its fixedfunctionality behavior to a mode where the actions being carried out on vertices and pixel fragments are dictated by their shader programs. The shader
program code is written in a language modeled on C and called the OpenGL
Shading Language (GLSL) [3]. Like C, the GLSL programs have to be compiled for their target processor, in this case the GPU. Different GPUs from
different manufacturers such as NVIDIA, 3Dlabs and ATI will require different machine code, and so the GLSL compiler needs to be implemented
as part of the graphics device driver which is supplied with a specific GPU.
i
i
357
After compilation, the device driver forwards the machine instructions to the
hardware. Contrast this with Direct3D where its shader language is compiled
by Direct3D, passed back to the application and then sent to the rendering
hardware.
The GLSL syntax is rich, and we do not have space to go into it in any
detail here. Rost et al. [3, 9] provide a comprehensive language definition
and many examples. However, it should not be difficult to follow the logical
statements in a GLSL code listing for anyone who is familiar with C. Nevertheless, it is important to appreciate the special features and restrictions which
arise because of the close relationship with the language of 3D graphics and
the need to prevent a shader code from doing anything that does not match
well with the GPU hardware; for example, trying to access vertex data in a
fragment shader. This is because at the fragment processing stage in the rendering pipeline, the vertex data has been interpolated by rasterization and the
fragment does not know from which polygon it came. So it not possible to
determine whether the fragment came from a triangular polygon or a quadrilateral polygon or what the coordinates of the polygons vertices were. We
shall now focus on what we believe to be the four key concepts in OpenGL
Version 2:
1. The functions to compile, load and interact with the shader language
programs.
2. The GLSL program data-type qualifiers: const, attribute, uniform
and varying.
3. Built-in variables allow the shader programs to access and alter the vertex and fragment data as it is passed through the processors.
4. Samplers: these unique data types allow fragment shaders access to
OpenGLs texture maps.
Perhaps the most important concept in shader programming is this: the
output from the vertex shader, which includes some of the built-in and all the
user-defined varying variables that have been calculated in the vertex shader or set
on a per-vertex basis, are interpolated across a polygon and passed to the fragment
shader as a single value specifically for that fragment (or pixel) alone.
i
i
358
14.1.2
i
i
359
is time to use them, they are identified by their handle, e.g., the variable
progList.
GLuint vertList, fragList, progList;
// handles for shader codes and program
char VertexSource[]=" ....";
char fragmentSource[]=" ....";
vertList = glCreateShader(GL_VERTEX_SHADER);
fragList = glCreateShader(GL_FRAGMENT_SHADER);
glShaderSource(vertList, 1, &VertexSource,NULL);
glShaderSource(fragList, 1, &FragmentSource,NULL);
glCompileShader(vertList);
glCompileShader(fragList);
progList = glCreateProgram();
glAttachShader(progList,vertList);
glAttachShader(progList,fragList);
glLinkProgram(progList);
Listing 14.1. Installing a vertex and fragment shader pair. The OpenGL functions set
an internal error code and this should be checked.
14.1.3
The OpenGL Shading Language supports the usual C-style data types of
float, int etc. To these it adds various types of vector and matrix declarations such as vec3 (a 3 1 vector), vec2, and mat4 (a 4 4 array), for
i
i
360
example. Vectors and matrices are particularly pertinent for 3D work. Whilst
these additional data types occur frequently in shader program, perhaps the
most interesting element of GLSL variables are the qualifiers, e.g., varying.
These are important because when applied to global variables (declared outside the main or other functions), they provide a mechanism for parameters
to be passed to the shader code from the application. They also provide a
mechanism for the vertex shader to pass information to the fragment shader
and to interact with the built-in variables which provide such things as the
transformation matrices calculated in the fixed-functionality pipeline. A good
understanding of the meaning and actions of the qualifiers is essential if one
wants to write almost any kind of shader. We will now consider a brief outline
of each:
const. As the name implies, these are constants and must be initialized when declared, e.g., const vec3 Xaxis (1.0,0.0,0.0);.
attribute. This qualifier is used for variables that are passed to
the vertex shader from the application program on a per-vertex basis. For example, in the shader they are declared in the statement
attribute float scale;. A C/C++ application program using a
shader with this global variable would set it in a call to the function
glVertexAttrib1f(glGetAttribLocation(Prog,"scale"),2.0).
i
i
14.1.4
361
In both the vertex and fragment shader, the ability to read from and write to
predefined (called built-in) qualified variables provides the interface to the
fixed-functionality pipeline and constitutes the shaders input and output. For
example, attribute vec4 gl Vertex; gives the homogeneous coordinates
of the vertex being processed. The built-in variable vec4 gl Position;
must be assigned a value at the end of the vertex shader so as to pass the
vertex coordinates into the rest of the pipeline. A vertex shader must also
apply any transformations that are in operation, but this can be done by a
built-in shader function: ftransform();. Thus a minimal vertex shader can
consist of the two lines:
void main(void){
// apply the fixed functionality transform to the vertex
gl_Position=ftransform();
// copy the vertex color to the front side color of the fragment
gl_FrontColor=gl_Color;
}
Qualifier
attribute
attribute
attribute
attribute
const
Type
Variable
vec4
vec4
vec4
vec4
int
gl
gl
gl
gl
gl
const
int
gl MaxTextureCoords
Read
uniform
uniform
uniform
mat4
mat4
struct
gl ModelViewMatrix
gl ProjectionMatrix
gl LightSourceParameters
vec4
gl Position
Read
Read
Read
Write
varying
vec4
gl FrontColor
Write
varying
vec4
gl BackColor
Write
varying
vec4
gl TexCoord[]
Write
Vertex
Color
Normal
MultiTexCoords0
MaxLights
R/W
Read
Read
Read
Read
Read
Information
Input vertex coordinates
Input vertex color
Input vertex normal
Vertex texture coords
Maximum number of
lights permitted
Maximum number of
texture coordinates
The model position matrix
The projection matrix
Light source settings
Must be written by all
vertex shaders
Front facing color for
fragment shader
Back color to pass to
fragment shader
Texture coordinates at
vertex
Table 14.1. A subset of a vertex shaders input and output built-in variables.
i
i
362
Qualifier
uniform
uniform
varying
varying
Type
Variable
vec4
gl FragCoord
R/W
Read
bool
gl FrontFacing
Read
mat4
mat4
vec4
vec4
gl
gl
gl
gl
vec4
gl FragColor
Read
Read
Read
Read
Write
vec4
gl FragDepth
Write
ModelViewMatrix
ProjectionMatrix
Color
TexCoord[]
Information
Window relative coordinates of
fragment
TRUE if fragment part of frontfacing primitive
The model position matrix
The projection matrix
Fragment color from vertex shader
Texture coordinates in fragment
Output of fragment color from
shader
Z depth, defaults to fixed function
depth
Table 14.2. A subset of a fragment shaders input and output built-in variables.
Note that the = assignment operator is copying all the elements of a vector from source to destination. Tables 14.1 and 14.2 list the built-in variables
that we have found to be the most useful. A full list and a much more detailed
description on each can be found in [9] and [3].
OpenGL Version: 1.5.4454 Win2000 Release - Vendor : ATI Technologies Inc.
OpenGL Renderer: RADEON 9600 XT x86/MMX/3DNow!/SSE
No of aux buffers 0 - GL Extensions Supported:
GL_ARB_fragment_program GL_ARB_fragment_shader
...
GL_ARB_vertex_program
GL_ARB_vertex_shader
...
OpenGL Version: 2.0.0 - Vendor : NVIDIA Corporation
OpenGL Renderer: Quadro FX 1100/AGP/SSE/3DNOW!
No of aux buffers 4 - GL Extensions Supported:
GL_ARB_fragment_program GL_ARB_fragment_shader
...
GL_ARB_vertex_program
GL_ARB_vertex_shader
...
Figure 14.2. Output from the GLtest program, which interrogates the display hardware to reveal the level of feature supported. Extensions that begin with GL ARB
indicate that the feature is supported by the OpenGL Architecture Review Board
and are likely to be included in all products. Those which begin with something like
GL NV are specific to a particular GPU manufacturer.
14.1.5
Upgrading to Version 2
i
i
363
#include <stdio.h>
#include <gl\glu.h>
#include <gl\glaux.h>
void main(int argc, char **argv){
const char *p;
GLint naux;
auxInitDisplayMode(AUX_SINGLE | AUX_RGBA);
auxInitPosition(0,0,400,400);
auxInitWindow(argv[0]);
printf("OpenGL Version: %s\n", glGetString(GL_VERSION));
printf("OpenGL Vendor : %s\n", glGetString(GL_VENDOR));
printf("OpenGL Renderer: %s\n", glGetString(GL_RENDERER));
printf("\nGL Extensions Supported:\n\n");
p=glGetString(GL_EXTENSIONS);
while(*p != 0){putchar(*p); if(*p == )printf("\n"); p++; }
glGetIntegerv(GL_AUX_BUFFERS,&naux);
printf("No of auxiliary buffers %ld\n",naux);
glFlush();
}
Listing 14.2. List the OpenGL extensions supported by the graphics hardware.
i
i
364
14.1.6
There are two alternatives: use the GLUT (see Section 13.2) or obtain the
header file glext.h (available from many online sources), which contains
definitions for all the OpenGL constants and function prototypes for all
functions added to OpenGL since Version 1. Make sure you use the Version 2 glext.h file. Of course, providing prototypes for functions such as
glCreateProgram() does not mean that when an application program is
linked with the OpenGL32.lib Windows stub library for opengl32.dll,
the function code will be found. It is likely that a missing function error will
occur. This is where an OpenGL function that is unique to Windows come
in. Called wglGetProcAddress(...), its argument is the name of a function whose code we need to locate in the driver. (Remember that most of the
OpenGL functions are actually implemented in the hardware driver and not
in the Windows DLL.)
Thus, for example, if we need to use the function glCreateProgram(),
we could obtain its address in the driver with the following C code:
typedef GLuint (* PFNGLCREATEPROGRAMPROC) (void);
PFNGLCREATEPROGRAMPROC glCreateProgram;
glCreateProgram = (PFNGLCREATEPROGRAMPROC) wglGetProcAddress(glCreateProgram);
i
i
14.2
365
14.2.1
i
i
366
Listing 14.3. GPU programmability can be added to an existing application by providing most of the functionality in a single file. All the shaders can be compiled
and executed by using a few functions. This approach minizes use of the Version 2
extensions which need to be called directly from the applications rendering routine.
3
For example, glUniform3f(..) sets a three-element vector, glUniform1i(..)
an integer and so on.
i
i
367
i
i
368
else if (i == ... // set some of the "uniform" variables for shader "i"
(ones that dont need to change)
glUniform3f(getUniLoc(progList[i], "LightColor"), 0.0, 10.0, 4.0);
glUniform1f(getUniLoc(progList[i], "Diffuse"),0.45);
// and make any other settings
.. //
}
// GO TO NEXT SHADER
}
void ShadersClose(void){
int i; for(i=0;i<NPROGRAMS;i++){
glDeleteShader(vertList[i]); glDeleteShader(fragList[i]);
glDeleteProgram(progList[i]);
}
}
void UseShaderProgram(long id){
// choose which shader program to run
glUseProgram(id);
// tell OpenGL
if(id == 1){
// set any parameters for the shader
glUniform3f(getUniLoc(progList[id], "BrickColor"), 1.0, 0.3, 0.2);
}
else if ( ...) { //
// do it for the others.
}
// It may be necessary to set some of
}
// the attrubutes in the rendering
// code itself.
..//
extern GLuint progList[];
void RenderScene(void){
.. //
UseShaderProgram(3);
// Phong shading program
glEnable(GL_COLOR_MATERIAL);
glBegin(GL_TRIANGLES);
.. //
fp=MainFp; for(i=0;i<Nface;i++,fp++){ //
..// render the polygon
for(k=0;k<3;k++}{ glMaterial*(..); glVertex(...); gNormal(...);
}
glEnd();
UseShaderProgram(0);
// switch back to fixed functionality
// etc/
Listing 14.4. Changing the rendering code to accommodate GPU shaders. The code
fragments that appear here show the modifications that need to be made in Listings 13.11, 13.12 and 13.14 in Chapter 13.
i
i
369
There are a few comments we should make about the changes to the
rendering code that become necessary to demonstrate the use of the example
shaders we will explore shortly.
We will not use display lists in the examples. Instead, the WM PAINT
message which calls Draw3dScene(..) (in Listing 13.11) will call
RenderScene() as shown in Listing 14.4.
The polygons are sorted so that all those that use the same shader program are processed together.
Individual image maps are loaded into texture memory at the same
time as the mesh is loaded. Each image map is bound to a different
OpenGL texture identifier:
i
i
370
glActiveTexture(GL_TEXTURE0+textureID);
glBindTexture(GL_TEXTURE_2D,textureID+1);
Thus, when the shader program that carries out the texturing is executing, for each image map, a uniform variable (for example "RsfEnv") is
set with a matching texture identifier, such as GL TEXTURE0+1+k.
14.3
When vertex and fragment shader programs are installed, the fixedfunctionality behavior for vertex and fragment processing is disabled. Neither ModelView nor ModelViewProjection transformations are applied to
vertices. Therefore, at the very least, the shader programs must apply these
transformations.
14.3.1
It is possible to provide a complete emulation of the fixed-functionality behavior of the rendering pipeline in a pair of programmable shaders. The code
required to do this is quite substantial, because transformations, mapping,
lighting, fog etc. all have to be accounted for. Rost [9] provides a comprehensive example of fixed-functionality emulation. In this section, the examples
will highlight how to implement a few of the most visually noticeable effects;
that is, Phong shading and bump mapping.
14.3.2
i
i
371
Listing 14.5. A shader program pair that emulates very basic Gouraud shading for a
single light source and plain colored material. The OpenGL eye coordinate system
places the origin at (0, 0, 0) with a direction of view of (0, 0, 1) and an up direction
of (0, 1, 0). OpenGLs y-axis is vertical.
i
i
372
//
//
//
//
void main(void){
// copy the vertex position
// compute the transformed normal
PP
= gl_ModelViewMatrix * gl_Vertex;
PN
= gl_NormalMatrix * gl_Normal ;
gl_Position
= gl_ModelViewProjectionMatrix * gl_Vertex; // required
gl_FrontColor = gl_Color;
// required
}
void main(void){
// compute a vector from the surface fragment to the light position
vec3 lightVec
= normalize(vec3(gl_LightSource[0].position) - PP);
// must re-normalize interpolated vector to get Phong shading
vec3 vnorm=normalize(PN);
// compute the reflection vector
vec3 reflectVec = reflect(-lightVec, vnorm);
// compute a unit vector in direction of viewing position
vec3 viewVec
= normalize(-PP);
// calculate amount of diffuse light based on normal and light angle
float diffuse
= max(dot(lightVec, vnorm), 0.0);
float spec
= 0.0;
// if there is diffuse lighting, calculate specular
if(diffuse > 0.0){
spec = max(dot(reflectVec, viewVec), 0.0);
spec = pow(spec, 32.0);
}
// add up the light sources - write into output build-in variable
gl_FragColor = gl_Color * diffuse + lc * spec;
}
Listing 14.6. The Phong shading program pair for a single light source.
i
i
373
Figure 14.3.
14.3.3
Image Mapping
OpenGLs fixed-functionality texture/image mapping is excellent, and the extensions introduced in minor Versions 1.1 through 1.5 which allowed several
maps to be blended together is sufficient for most needs. Nevertheless, the
shading language (GLSL) must facilitate all forms of image mapping so that
it can be used (for example, with Phongs shading/lighting model) in all programs. In GLSL, information from texture maps is acquired with the use
of a sampler. Rost [9] provide a nice example of image map blending with
i
i
374
// Vertex Shader
void main(void){
// same as Gouraud shaded
vec3 PP = vec3 (gl_ModelViewMatrix * gl_Vertex);
// normal vector
vec3 PN = normalize(gl_NormalMatrix * gl_Normal);
// light direction vector
vec3 LV = normalize(vec3(gl_LightSource[0].position)-PV);
// diffuse illumination
Diffuse = max(dot(LV, PN), 0.0);
// Get the texture coordinates - choose the first // we only assign one set anyway
// pass to fragment shader
TexCoord = gl_MultiTexCoord0.st;
gl_Position = ftransform();
// required!
}
three maps. We will keep it simple in our example and show how a shader
program pair can paint different maps on different fragments, which is a requirement for our small mesh model viewer application. The code for these
i
i
375
Normal;
EyeDir;
void main(void) {
vec4 pos
= gl_ModelViewMatrix * gl_Vertex;
Normal
= normalize(gl_NormalMatrix * gl_Normal);
// use swizzle to get xyz (just one of many ways!)
EyeDir
= pos.xyz;
gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex; // required
}
mapping
Normal;
EyeDir;
i
i
376
shader programs is given in Listing 14.7. These are the GLSL programs used
in conjunction with the mapping part of the code in Listing 14.4.
14.3.4
Reflection Mapping
The mesh model viewer program must also be able to simulate reflection by
using a mapping technique. Fixed functionality provided an easy way to do
this by automatically generating texture coordinates based on an environment
map (as was seen in Section 7.6.2). To write a GLSL program to simulate
reflection/environment mapping, our code will also have to do that. Listing 14.8 shows the outline of a suitable shader pair. For reflection maps, we
do not want to use any surface lighting models, and so this is omitted.
14.3.5
Bump Mapping
Our portfolio of shader programs would not be complete without the ability
to use an image texture as a source of displacement for the surface normal, i.e.,
bump mapping. Bump mapping requires us to use a sampler2D and provide
a tangent surface vector. The same tangent surface vector that we will use
when discussing procedural textures (see Section 14.3.6) will do the job. You
can see how this is generated from the image map parameters if you look at
the application program code for which Listing 14.4 is an annotated outline.
Vertex and fragment shaders for bump mapping are given in Listing 14.9. A
few of comments are pertinent to the bump-mapping procedure:
To minimize the work of the fragment shader, the vertex shader calculates vectors for light direction and eye (viewpoint) direction in a coordinate frame of reference that is in the plane of the polygon which will
be rendered by the fragment shader (this is called surface-local space).
Figure 14.4 illustrates the principle. In the world (global) coordinate
frame of reference, the eye is at the origin. The polygons normal n,
tangent t and bi-normal b are calculated in the global frame of reference. Transformation of all vectors to a surface-local reference frame
allows the fragment shader to perform position-independent lighting
calculations because the surface normal is always given by the vector
(0, 0, 1).
By doing this, we ensure that the surface normal vector is always given
by coordinates (0, 0, 1). As such, it is easily displaced (or bumped)
in the fragment shader by adding to it the small changes u and v
obtained from the bump map.
i
i
377
To obtain the surface-local coordinates, we first set up a rectilinear coordinate system with three basis vectors. We have two of them already: the surface normal n = (nx , ny , nz ) and the tangent vector
t = (tx , ty , tz ) (passed to the vertex shader as a vector attribute from the
application program). The third vector b = (bx , by , bz ) (known as the
bi-normal) is defined as b = n t. Vectors dw (such as viewpoint and
light directions) are transformed from world coordinates to the surfacelocal frame by ds = [Tws ]dw , where [Tws ] is the 3 3 transformation
matrix:
tx ty tz
(14.1)
[Tws ] = bx by bz .
nx ny nz
i
i
378
// Vertex Shader //
Bump mapping
void main(void) {
// do the usual stuff
EyeDir
= vec3 (gl_ModelViewMatrix * gl_Vertex);
gl_Position
= ftransform();
gl_FrontColor=gl_Color;
// pass on texture coords to fragment shader
TexCoord
= gl_MultiTexCoord0.st;
// Get normal, tangent and third vector to form a curviliner coord frame
// of reference that lies in the plane of the polygon for which we are
// processing one of its vertices.
vec3 n = normalize(gl_NormalMatrix * gl_Normal);
vec3 t = normalize(gl_NormalMatrix * Tangent);
vec3 b = cross(n, t);
// get the light direction and viewpoint (eye) direction in this new
// coordinate frame of reference.
vec3 v;
vec3 LightPosition=vec3(gl_LightSource[0].position);
v.x = dot(LightPosition, t);
v.y = dot(LightPosition, b);
v.z = dot(LightPosition, n);
LightDir = normalize(v);
v.x = dot(EyeDir, t);
v.y = dot(EyeDir, b);
v.z = dot(EyeDir, n);
EyeDir = normalize(v);
}
// Fragment Shader
- Bump mapping
i
i
379
i
i
380
14.3.6
Procedural Textures
It is in the area of procedural textures that the real power of the programmable
GPU becomes evident. More or less any texture that can be described algorithmically can be programmed into the GPU and rendered much faster than
with any software renderer. This is an incredible step forward!
// Vertex Shader - procedural bumps
varying vec3 LightDir;
varying vec3 EyeDir;
// vectors to be passed
// to fragment shader
void main(void) {
// usual stuff feed input in the OpenGL machione
gl_Position
= ftransform();
gl_FrontColor=gl_Color;
gl_TexCoord[0] = gl_MultiTexCoord0;
// get a coordinate system based in plane of polygon
vec3 n = normalize(gl_NormalMatrix * gl_Normal);
vec3 t = normalize(gl_NormalMatrix * Tangent);
vec3 b = cross(n, t);
// Calculate the direction of the light and direction of
// the viewpoint from this VERTEX using a coord system
// based at this vertex and in the plane of its polygon.
EyeDir
= vec3 (gl_ModelViewMatrix * gl_Vertex);
vec3 LightPosition=vec3(gl_LightSource[0].position);
vec3 v;
Listing 14.10. Procedural bump mapping, illustrating how to generate little bumps or
i
i
381
// Fragment Shader
- procedural bumps
i
i
382
14.4
14.5
Summary
This chapter has introduced the concept of the programmable GPU. It showed
how OpenGL Version 2s high-level-like shading language can be put into
practice to enhance the realistic appearance of objects and scenes whilst still
maintaining the real-time refresh rates so vital for VR applications.
i
i
Bibliography
14.5.1
383
Afterthought
Bibliography
[1] Advanced Micro Devices, Inc. RenderMonkey Toolsuite. http://ati.amd.com/
developer/rendermonkey/, 2006.
[2] R. S. Ferguson. Practical Algorithms for 3D Computer Graphics. Natick, MA: A
K Peters, 2001.
[3] J. Kessenich et al. The OpenGL Shading Language, Language Version 1.10,
Document Revision 59. http://oss.sgi.com/projects/ogl-sample/registry/ARB/
GLSLangSpec.Full.1.10.59.pdf, 2004.
[4] NVIDIA Corporation. Cg Toolkit 1.5. http://developer.nvidia.com/object/
cg toolkit.html, 2006.
[5] M. Olano et al. Real-Time Shading. Natick, MA: A K Peters, 2002.
[6] M. Olano et al. Real-Time Shading. SIGGRAPH 2004 Course Notes, Course
#1, New York: ACM Press, 2004.
[7] K. Perlin. An Image Synthesizer. Computer Graphics 19:3 (1985) 287296.
[8] S. Raghavachary. Rendering for Beginners: Image Synthesis using RenderMan.
Boston, MA: Focal Press, 2004.
[9] R. Rost. OpenGL Shading Language. Reading, MA: Addison-Wesley Professional, 2006.
[10] D. Shreiner, M. Woo, J. Neider and T. Davis. OpenGL Programming Guide,
Fifth Edition. Reading, MA: Addison-Wesley Professional, 2005.
i
i
15
Using
Multimedia
in VR
385
i
i
386
15.1
Meeting DirectShow for the first time, it can seem like an alien programming
concept. The terms filter, signal and FilterGraph are not traditional programming ideas, and there are other complicating issues.
i
i
387
In DirectShow, an application program does not execute its task explicitly. Instead, it builds a conceptual structure. For example, the FilterGraph
contains components which are called the filters. The application then connects the filters to form a chain from a source to a sink. The route from
source to sink is essentially a signal path, and in the context of a multimedia application, the signal consists of either audio or video samples. Once
the FilterGraph is built, the application program instructs it to move samples
through the filters from source to sink. The application can tell the graph
to do many things to the data stream, including pausing the progress of the
samples through the filter chain. But just what is appropriate depends on the
nature of the source, sink and filters in between.
To make the idea of a FilterGraph and the signal processing analogy a
little more tangible, we will quote a few examples of what might be achieved
in an application program using DirectShow FilterGraphs:
The pictures from a video camera are shown on the computers
monitor.
A DVD movie is played on the computers output monitor.
The output from a video camera is saved to a multimedia file on disk.
A movie or DVD is decoded and stored on the computers hard drive
using a high compression scheme.
A series of bitmap pictures is combined to form a movie file.
Frames from a movie are saved to bitmap image files on disk.
An audio track is added to or removed from a movie file.
Figure 15.1 shows diagrammatically how some of these FilterGraphs might
look. Indeed, the DirectShow SDK includes a helpful little application program called GraphEdit which allows you to interactively build and test FilterGraphs for all kinds of multimedia applications. The user interface for
GraphEdit is illustrated in Figure 15.2. The four filters in Figure 15.1 perform four different tasks. In Figure 15.1(a), the output of a digital video
recorder is compressed and written to an AVI file. The signal is split so that
it can be previewed on the computers monitor. In Figure 15.1(b), a video
signal from a camera is mixed with a video from a file and written back to
disk. A preview signal is split off the data stream. The mixer unit can insert a
number of effects in real time. In Figure 15.1(c), a video stream from an AVI
i
i
388
file is combined with a live audio commentary. The signals are encoded to the
DVD standard, combined and written to DVD. In Figure 15.1(d), two video
sources from disk files are mixed and written back to file with simultaneous
preview.
Figure 15.2. The GraphEdit interactive tool for testing DirectShow FilterGraphs. It
allows the programs author to use any of the registered filters available on the computer, connect them together and experiment with their action.
i
i
389
i
i
390
i
i
15.2. Methodology
15.1.1
391
Threading
A great deal of emphasis has been put on the importance of timing in DirectShow FilterGraphs. In addition, the FilterGraph seems almost to be an entity
working independently and only communicating with its driving application
from time to time. DirectShow is a multi-threaded technology. The FilterGraph executes in a separate thread2 of execution. Programming in a multithreaded environment bring its own set of challenges. But to achieve the
real-time throughput demanded by multimedia applications, our programs
must embrace multi-threading concepts.
Nevertheless, this book is not focused on multi-threading or even about
writing multi-threaded code, so we will gloss over the important topics of
deadlocks, interlocks, synchronization, critical sections etc. In our example
programs, we will use only the idea of the critical section, in which blocks of
code that access global variables must not be executed at the same time by
different threads; say one thread reading from a variable and another thread
writing to it. We suggest that if you want to get a comprehensive explanation
of the whole minefield of multi-threaded programming in Windows, consult
one of the specialized texts [1] or [3].
There is one aspect of multi-threaded programming that we cannot avoid.
Applications must be linked with the multi-threaded C++ libraries.
15.2
Methodology
i
i
392
i
i
393
15.3
A Movie Player
As a first application, we will look at a very basic use of DirectShow: playing an AVI movie file. Its simplicity is the very reason why DirectShow can
be so useful. The FilterGraph we require for this application is shown in
Figure 15.4.
The application has an entry point, message loop and window message
handling function, as illustrated in Listing 15.1.
All the other DirectShow applications that we will examine in
this and subsequent chapters use an overall skeleton structure as
outlined in Listing 15.1. Each example will require some minor
modifications and additions to each of the functions listed. We
will not reprint this code in each program listing and only refer
to additions if they are very significant. Using our annotated
printed listings, it should be possible to gain an understanding
of the program logic and follow the execution path.
i
i
394
#include <windows.h>
#include <atlbase.h>
#include <dshow.h>
HWND ghApp=0;
HINSTANCE ghInst=0;
char gFileName[MAX_PATH];
i
i
395
i
i
396
void OpenClip(void){
HRESULT hr;
if (! GetClipFileName(gFileName))return;
hr = PlayMovieInWindow(g_szFileName);
if (FAILED(hr)) CloseClip();
}
void CloseClip(void){
HRESULT hr;
if(pMC) hr = pMC->Stop();
CloseInterfaces();
}
Listing 15.2. These functions build and execute the movie player graph.
i
i
397
Listing 15.3. Changes required to make the renderer filter play the movie within the
applications client area.
i
i
398
rewind the movie back to the start or load another one. To put this functionality into the application, the code in Listings 15.1 and 15.4 is modified by
adding some instructions, global variables and extra functions as follows.
Two extra interface pointers are required:
IVideoWindow *pVW
IMediaEventEx *pME
= NULL;
= NULL;
In the WndMainProc() message handler, we intercept graph event messages and act upon a change of client area size so that the video window always
fits inside the client area of the container window. (We are really asking another thread, the one running the renderer filter, to make sure it draws inside
the applications window.) This is done with the following code fragment:
case WM_GRAPHNOTIFY:
HandleGraphEvent();
break;
case WM_SIZE:
if ((hWnd == ghApp))MoveVideoWindow();
break;
We tell DirectShow to route all messages through our applications window with the pVW->NotifyOwnerMessage(...) method of the video window interface. Listing 15.3 shows other modifications we need to make to
render the output into the client area of the main window.
Despite the appearance that the video output is now in the main
windows client area, it is still being drawn by a system renderer.
In most circumstances, this is irrelevant, but if you try to capture a frame from the movie by reading pixels from the windows client area, all you will get is a blank rectangle. Similarly,
if you try and copy the full screen to the clipboard then the part
of the window where the movie is rendered will again appear
blank. The reason for this is that the DirectShow renderer bypasses the GDI drawing functions and communicates directly
with the graphics hardware so as to maximize performance.
i
i
15.3.1
399
As a final touch for this application, we want to show you how to make your
movie player accept files that have been dragged and dropped into the window, from Windows Explorer, for example. Drag and drop is really standard
stuff but can be hard to find out how to do it from the Windows API docu-
#include
#include
#include
#include
<windows.h>
<tchar.h>
<shlobj.h>
<oleidl.h>
program
void CloseDragAndDrop(void){
if(pT)delete pT;
}
i
i
400
i
i
401
STDMETHODIMP CDropTarget::Drop(
IDataObject * pDataObject,
//Pointer to the interface for the source data
DWORD grfKeyState, //Current state of keyboard modifier keys
POINTL pt,
//Current cursor coordinates
DWORD * pdwEffect //Pointer to the effect of the drag-and-drop
// operation
){
FORMATETC ff;
ff.cfFormat=CF_HDROP;
ff.ptd=NULL;
ff.dwAspect=DVASPECT_CONTENT;
ff.lindex= -1;
ff.tymed=TYMED_HGLOBAL;
STGMEDIUM pSM;
pDataObject->GetData(&ff,&pSM);
HDROP hDrop = (HDROP)(pSM.hGlobal);
TCHAR name[512];
int i=DragQueryFile(hDrop,0xFFFFFFFF,name,512);
DragQueryFile(hDrop,0,name,512);
// now that weve got the name as a text string use it ///////////
OpenDraggedClip(name); // this is the application defined function
//////////////////////////////////////////////////////////////////
ReleaseStgMedium(&pSM);
return S_OK;
}
15.4
Video Sources
The other major use we will make of DirectShow concerns data acquisition.
Essentially, this means grabbing video or audio. In this section, we will set
up some example FilterGraphs to acquire data from live video sources. These
could be digital camcorders (DV) connected via the FireWire interface or
USB cameras ranging in complexity from a simple webcam using the USB 1
interface to a pair of miniature cameras on a head-mounted display device
multiplexed through a USB 2 hub. It could even be a digitized signal from an
analog video recorder or S-video output from a DVD player.
To DirectShow, all these devices appear basically the same. There are
subtle differences, but provided the camera/digitizer has a WDM (windows
driver model) device driver, DirectShow has a source filter which any application program can load into its FilterGraph and use to generate a stream of
video samples. We shall see in this section that using a live video source in a
i
i
402
OleInitialize(NULL);
SetUpDragAndDrop(hWnd)
Close DragAndDrop();
OleUninitialize();
and drop. When a file is dropped into the application window, the function
OpenDraggedClip() is called. Note also the change to OleInitialize() instead of CoInitialize().
i
i
403
lets an application program peek into the data stream and grab
samples, but it is also used to tell the upstream source filters that
they must deliver samples in a particular format.
As we build the code for a couple of examples using live video source filters, you will see that the differences from the previous movie player program
are quite small. The longest pieces of code associated with video sources relate
to finding and selecting the source we want. In the context of FilterGraphs
using video sources, the ICaptureGraphBuilder2 interface comes into its
own by helping to put in place any necessary intermediate filters using intelligent connect. Without the help of the ICaptureGraphBuilder2 interface,
our program would have to search the source and sink filters for suitable input and output pins and link them. Some of the project codes on the CD
and many examples from the DXSDK use this strategy, and this is one of the
main reasons why two DirectShow programs that do the same thing can look
so different.
15.4.1
For the most part (the WinMain entry, message handler function etc.), the
code for this short program is very similar to the movie player code, so we
will not list it here. Instead, we go straight to look at the differences in the
structure of the FilterGraph given in Listing 15.7. A couple of functions merit
a comment:
1. g pCapture->RenderStream(). This ICaptureGraphBuilder2
method carries out a multitude of duties. Using its first argument,
we select either a preview output or a rendered output. This gives an
application program the possibility of previewing the video while capturing to a file at the same time. The second argument tells the method
that we need video samples. The last three argumentspSrcFilter,
NULL, NULLput the graph together. They specify:
(a) The source of the video data stream.
(b) An optional intermediate filter between source and destination; a
compression filter for example.
(c) An output filter. If this is NULL, the built-in renderer will be used.
If we are recording the video data into a file using a file writing filter (say
called pDestFilter), the last three arguments would be (pSrcFilter,
i
i
404
Listing 15.7. Build FilterGraph thatpreviews live video from a DV/USB camera. Note
the call to FindCaptureDevice(), which finds the video source filter. Its code is
i
i
405
Listing 15.8. Enumerating and selecting the video capture device. In this case, we
take the first source found. Note the use of the ATL CComPtr pointer class, which
handles our obligation of releasing interfaces when we are finished with them.
i
i
406
code can usually be used repeatedly. In FindCaptureDevice() (see Listing 15.8), the strategy adopted is to enumerate all devices and pick the first
suitable one. This requires some additional interfaces and a system moniker.
The DirectX SDK contains a comprehensive discussion on device
enumeration.
Listing 15.8 may be extended to find a second video source for stereo
work (see Chapter 16) or make a list of all attached video devices so that the
user can select the one she wants.
15.4.2
Capturing Video
Recording the video signal into an AVI file requires only two minor additions
to the video preview example (see Listing 15.9), both of which require the use
of methods from the ICaptureGraphBuilder2 interface:
1. g pCapture->SetOutputFileName(..). This method returns a
pointer to a filter that will send a video data stream to a number of
different file formats. In this example, we choose an AVI type.
2. g pCapture->RenderStream(...). This method is used twice. The
first call is to render to an onscreen window linked to a preview pin on
the video source filter. A second call requests a PIN CATEGORY CAPTURE
pin on the source filter and links it to the file writer filter directly.
The last three arguments (pSrcFilter,NULL,g pOutput) specify the
connection.
We have considered file sources and sinks, video sources and a variety of
display renderers. This only leaves one element of DirectShow to take a look
at in order to have a comprehensive set of components which we will be able
to put together in many different combinations. We now turn to the last
element in DirectShow of interest to us, the in-place filter.
15.5
In this section, we will look a bit closer at the key element in DirectShow that
makes all this possiblethe filterand how it fits into a FilterGraph to play
its part in some additional utility programs. We will also make more general comments and give overview code listings. and defer details for specific
applications that will be covered in subsequent chapters.
i
i
407
IBaseFilter * g_pOutput=NULL;
HRESULT CaptureVideo(){
HRESULT hr;
IBaseFilter *pSrcFilter=NULL;
// Get DirectShow interfaces
hr = GetInterfaces();
// Attach the filter graph to the capture graph
hr = g_pCapture->SetFiltergraph(g_pGraph);
CA2W pszWW( MediaFile );
hr = g_pCapture->SetOutputFileName(
&MEDIASUBTYPE_Avi, // Specifies AVI for the target file.
pszWW,
&g_pOutput,
// Receives a pointer to the output filter
NULL);
// Use the system device enumerator and class enumerator to find
// a video capture/preview device, such as a desktop USB video camera.
hr = FindCaptureDevice(&pSrcFilter);
// Add Capture filter to our graph.
hr = g_pGraph->AddFilter(pSrcFilter, L"Video Capture");
// Render the preview pin on the video capture filter to a preview window
hr = g_pCapture->RenderStream(
&PIN_CATEGORY_PREVIEW,
&MEDIATYPE_Video,
pSrcFilter,
NULL, NULL);
// Render the capture pin into the output filter
hr = g_pCapture->RenderStream(
&PIN_CATEGORY_CAPTURE, // Capture pin
&MEDIATYPE_Video,
// Media type.
pSrcFilter,
// Capture filter.
NULL,
// Intermediate filter
g_pOutput);
// file sink filter.
// Now that the filter has been added to the graph and we have
// rendered its stream, we can release this reference to the filter.
pSrcFilter->Release();
hr = g_pMC->Run();
return S_OK;
}
Listing 15.9. Capturing from a live video source to a file requires only minor changes
There are two strategies one can adopt in writing a DirectShow filter:
1. Build the filter as a distinct and independent component. This is what
one might do if the filter has wide applicability or is to be made available commercially. A good example of this type of filter is a video
codec.
i
i
408
15.5.1
Video filters play a vital part in playing and compiling multimedia content.
A common use of an in-place filter is a compressor between camera and file
writer. If we wanted to use a filter of our own creation, one that has been
registered, an instance of it would be created and installed with the code in
Listing 15.10.
Since we are not going to register our filter as a Windows component, its
object is created in the C++ style with the keyword new. Listing 15.11 shows
how an in-place filter would be created and added to a FilterGraph in such
cases. Several examples of filters designed to be used in this way (both source
and renderer filters) will be used in Chapter 16 on stereopsis programming.
i
i
409
Listing 15.11. Adding an in-place filter when the filter is implemented within the
application alone. The class CImageProcessFilter will have been derived from
i
i
410
15.5.2
One application of an in-place filter that comes to mind is video overlay. For
example, you might want to mix some real-time 3D elements into a video
data stream. Other possibilities might be something as trivial as producing a
negative image or, indeed, any image-processing operation on a video stream.
The key to processing an image in an in-place filter is to get access to the pixel
data in each frame. This requires reading it from the upstream filter, changing
it and sending it on to the downstream filter. We wont give full details here
but instead outline a suitable class declaration and the key class methods that
do the work. Listing 15.12 shows the class declaration, and Listing 15.13
covers the overridden Transform(...) method that actually performs the
pixel processing task.
Like the other filters in DirectShow, when we are customizing one of
them, we can get by with some minor modifications to one of the examples
of the SDK. Most of the original class methods can usually be left alone to
get on with doing what they were put there to do. Looking at a few features
in the code, one can say:
To implement an image-processing effect, all the action is focused in
the Transform() method.
Listing 15.12. Class declaration for an in-place video stream processing function.
i
i
411
// Copy the input sample into the output sample - then transform the output
// sample in place.
HRESULT CIPProcess::Transform(IMediaSample *pIn, IMediaSample *pOut){
// Copy the properties across
CopySample(pIn, pOut);
// copy the sample from input to output
return LocalTransform(pOut); // do the transform
}
HRESULT CIPProcess::CopySample(IMediaSample *pSource, IMediaSample *pDest){
// Copy the sample data
BYTE *pSourceBuffer, *pDestBuffer;
long lSourceSize = pSource->GetActualDataLength();
pSource->GetPointer(&pSourceBuffer);
pDest->GetPointer(&pDestBuffer);
CopyMemory( (PVOID) pDestBuffer,(PVOID) pSourceBuffer,lSourceSize);
..// Other operations which we must do but for which we can
..// just use the default code from a template:
..// Copy the sample times
..// Copy the Sync point property
..// Copy the media type
..// Copy the preroll property
..// Copy the discontinuity property
..// Copy the actual data length
long lDataLength = pSource->GetActualDataLength();
pDest->SetActualDataLength(lDataLength);
return NOERROR;
}
HRESULT CIPProcess::LocalTransform(IMediaSample *pMediaSample){
BYTE *pData;
// Pointer to the actual image buffer
long lDataLen;
// Holds length of any given sample
int iPixel;
// Used to loop through the image pixels
RGBTRIPLE *prgb;
// Holds a pointer to the current pixel
AM_MEDIA_TYPE* pType = &m_pInput->CurrentMediaType();
VIDEOINFOHEADER *pvi = (VIDEOINFOHEADER *) pType->pbFormat;
pMediaSample->GetPointer(&pData); // pointer to RGB24 image data
lDataLen = pMediaSample->GetSize();
// Get the image properties from the BITMAPINFOHEADER
int cxImage = pvi->bmiHeader.biWidth;
int cyImage = pvi->bmiHeader.biHeight;
int numPixels = cxImage * cyImage;
// a trivial example - make a negative
prgb = (RGBTRIPLE*) pData;
for (iPixel=0; iPixel < numPixels; iPixel++, prgb++) {
prgb->rgbtRed = (BYTE) (255 - prgb->rgbtRed);
prgb->rgbtGreen = (BYTE) (255 - prgb->rgbtGreen );
prgb->rgbtBlue = (BYTE) (255 - prgb->rgbtBlue);
}
return S_OK;
}
Listing 15.13. The overridden Transform() method copies the input sample from
input pin to output pin and then modifies the pixel data in the output buffer.
i
i
412
The first thing the filter does is copy the video sample from the input
buffer to the output buffer.
The filter also copies all the other properties about the video sample.
The filter performs its changes on the video sample in the output
buffer.
The filter assumes that the video data is in RGB24 bit format. Any
graph using this filter must ensure (by using intelligent connect, for
example) that the samples are in this format.
The filter finds the dimensions of the video frame and gets a pointer
to the RGB24 data. It can then modify the pixel values in a manner of
its choosing. Listing 15.13 demonstrates a filter that makes the image
negative.
Consult the SDK example projects to see other examples of in-place
transform filters.
15.5.3
15.5.4
There are two ways to save frames from a video stream. The most conventional way is to insert the DirectShow grabber filter into the filter chain, use it
to request the samples be delivered in RGB24 format and write the sampled
bitmaps into individual files. This approach is discussed in [2]. Alternatively,
a renderer filter could save each frame to an image file at the same time as
they are previewed onscreen. We will look at an implementation of this second strategy in Chapter 16.
The wonderful flexibility of the FilterGraph approach to processing multimedia data streams means that anything we have suggested doing to a video
source could equally well be applied to a file source.
i
i
15.6
413
In this last section of the chapter, we shall investigate a topic to which we will
return in Chapter 18. In VR, placing live video into a synthetic environment
is fundamental to such things as spatial videoconferencing, in which a 3D
mesh model of a monitor could have its screen textured with the video feed
to give the illusion of a TV monitor. We will also see that by rendering a
movie into a texture applied to a deformable mesh, it is possible to distort a
projected image to correct for display on a non-flat surface, as for example
in the immersive system outlined in Chapter 4. The same technique can be
adapted to project images onto small shapes to give the illusion that the item
is sitting on the table in front of you.
Figure 15.5. Playing movies into textures. (a) Using a custom renderer filter at the
end of the FilterGraph chain that writes into a texture on a mesh which is rendered
as a 3D object using Direct3D or OpenGL. (b) Using an application that builds and
executes two FilterGraphs to play two movies at the same time using both OpenGL
and Direct3D.
i
i
414
We have looked (see Chapter 12) into the texturing and rendering of 3D
meshes in real time using both the OpenGL and Direct3D APIs. Either of
these powerful libraries has the potential to acquire one or more of its texture
image sources from the output of a DirectShow FilterGraph. Turning the
problem on its head, so to speak, what we can do is write custom rendering
DirectShow filters that render movies (or live video sources) into memory
buffers and use these memory buffers as the source image of texture maps
applied to 3D mesh models. To provide the illusion of video playing in the
3D scene, one simply arranges to render the scene continuously, say with a
Windows timer, so that it updates the textures from their sources every time
a frame is rendered.
Figure 15.5 shows how these filters might be arranged in a typical application and how we could use them in a couple of contrived examples to
illustrate the principles involved:
1. A program that plays two movies at the same time using an OpenGL
textured mesh in one window and a Direct3D textured mesh in the
other. To achieve the result requires running two FilterGraphs at the
same time. We will show how this is achieved by using two threads of
execution.
2. A program that plays the same movie into two windows. The same
texturing effect is achieved as in the first program, but this time only
one FilterGraph is required, with the output from our custom renderer
filter being simultaneously presented using the OpenGL and Direct3D
windows.
The appearance of both these programs is illustrated in Figure 15.6. The
logical structure of the program consists of:
A WinMain() function creates and displays a small application
window. Commands from the menu attached to the application window call two functions. One will render a movie using
OpenGL and the other will use Direct3D. Each function will:
1. Create and display a child window in which to render a 3D
image-mapped polygonal mesh using OpenGL/Direct3D.
2. Construct and run a DirectShow FilterGraph with a custom filter at the end of the chain to render the video frames
into the OpenGL/Direct3D textures.
i
i
415
Figure 15.6. Rendering video into a texture (twice). A small main window carries a
menu to control the display windows. Two child windows display the video (from
movie files) by playing them into a texture and rendering that texture onto mesh
models. The window on the left illustrates this being done using OpenGL and the
one on the right using Direct3D.
Commands from the menu also stop/start the movie, close the
child windows and delete the FilterGraphs.
We have already discussed how to implement the application framework, initialize OpenGL and Direct3D and render textured meshes in both system. So
we go straight on to discuss the rendering filters and how they copy the video
signal into OpenGL/Direct3D mesh textures. Whilst the rendering filters
for OpenGL and Direct3D are nearly identical in concept, there is one subtle difference and one feature which applies to all 3D textures that we must
highlight:
DirectShow uses multi-threading; therefore, the filters in the chain execute in a separate thread under the control of the FilterGraph manager.
On the other hand, our visualization (the rendering of the textured
mesh) runs in the same thread as the application program; the windows are updated in response to WM PAINT messages. We must find a
i
i
416
way to avoid a conflict between writing/creating the texture and reading/rendering it. In fact, we must use slightly different approaches
when rendering with Direct3D and OpenGL:
1. Direct3D texture objects have an interface method which can do
exactly what we need. Called LockRect( ..), the method collaborates with the Direct3D rendering device interface method
BeginScene() to prevent reading and writing from a texture at
the same time. Thus, in the case of Direct3D, our rendering filter
will write directly into the Direct3D texture object.
2. To use OpenGL (it not being a Microsoft product) requires a little more work. However, the same result can be achieved by setting up an intermediate memory buffer that is protected by the
standard thread programming technique of critical sections [1, 3].
So, in the case of OpenGL, our DirectShow rendering filter will
copy the video sample data into a system RAM buffer (pointed
to by the C++ pointer called Screen3). The windows WM PAINT
message handler (running in the main application thread) will
read from the RAM buffer, make the texture and render the
mesh.
Any image used as a mesh texture must have dimensions4 that are a
power of two. Of course, a movie frame is very unlikely to have a width
and height with pixel dimensions of 64, 128, 256 or other powers of
two. To get around this problem, we need to either scale the map or
adjust the texture-map coordinates. Scaling an image (i.e., changing its
dimension) can be quite slow unless hardware-accelerated, and so we
will adjust the surface image map coordinates to compensate.
For example, a video frame of resolution 720 576 is to be used as
a texture for a rectangular mesh. In both OpenGL and Direct3D,
the texture coordinates (u, v) at the four corners of the mesh should be
(0, 0), (0, 1), (1, 1) and (1, 0), as shown in Figure 15.7. If our hardware
requires that a texture has dimensions which are a power of two, we
must copy the pixel data into a memory block whose size is the next
4
OpenGL has always had this restriction, but it is now being relaxed in version 2.0. Direct3D can support texture dimensions that are not a power of two, but an application program
should not rely on this being available and therefore should still be able to adjust the source
dimensions accordingly.
i
i
417
Figure 15.7. Adjusting texture coordinates for image-map textures with image dimen-
sions that are not powers of two requires the (u, v) texture coordinates to be scaled so
that the part of the texture containing the image covers the whole mesh.
highest power of two exceeding 720 576. This is 1024 1024, but
then the image will only occupy the bottom left hand corner of the
texture rectangle. So for the image to still cover the whole rectangular
mesh, we must change the (u, v) coordinates at the four corners to be
720
720 576
576
), ( 1024
, 1024 ) and ( 1024
, 0).
(0, 0), (0, 1024
15.6.1
To write a custom renderer filter, we will take advantage of the examples that ship with the DirectX SDK. It doesnt have an example that exactly
matches our requirement, but the rendering into a texture example is pretty
close and provides useful hints to get us started. To implement the filter,
we derive a class from the DirectShow API base class CBaseVideoRenderer.
This powerful class offers all the features, such as the interfaces that handle
connecting pins and media type requests. By overriding three of the classs methods, we will get the renderer filter we need for our FilterGraph. Listing 15.14
presents the derived class details.
To be precise, we are not going to write a complete DirectShow filter, because we dont need to! The filter will not be used in other applications, and
we want to keep the code as short and self-contained as possible. So we will
pass over such things as component registration and class factory implementation. By deriving the CRendererD3D class from CBaseVideoRenderer,
all the standard behavior and interfaces which the base class exhibits will be
present in the derived one. Listing 15.15 gives the constructor and destructor.
i
i
418
// Although a GIUD for the filter will not be used we must still declare
// one because the base class requires it.
struct
__declspec(uuid("{7B06F833-4F8B-48d8-BECF-1F42E58CEAD5}"))CLSID_RendererD3D;
class CRendererD3D : public CBaseVideoRenderer
{
public:
CRendererD3D(LPUNKNOWN pUnk,HRESULT *phr);
CRendererD3D();
public:
//
//
//
HRESULT
//
HRESULT
//
HRESULT
LONG m_lVidWidth;
LONG m_lVidHeight;
LONG m_lVidPitch;
//
//
//
//
Video/movie width
Video/movie height
Pitch is the number of bytes in the video buffer
required to step to next row
};
Listing 15.14. The derived class for the Direct3D texture renderer filter. The member
variables record the dimension of the movie/video frames, and the pitch specifies how
the addresses in RAM changes from one row to the next. This allows the rows in an
image to be organized into memory blocks that are not necessarily the same size as
the width of the frame.
i
i
419
// Do nothing
Listing 15.15. The Direct3D video texture renderer class constructor and destructor.
The class constructor has very little work to do; it simply initializes the base class and
one of the member variables.
Listing 15.16. The check media type method returns S OK when a suitable video
format has been enumerated. The DirectShow mechanism will put in place whatever
(decoder etc.) is needed to deliver media samples in that format.
i
i
420
Listing 15.17. The SetMediaType() method extracts information about the movies
video stream and creates a texture to hold the decoded frame images.
being delivered by the DirectShow filter chain and copies the image
data into the texture. See Listing 15.18.
Overriding these three methods is all we need to do to complete the renderer filter. If we wanted, we could easily turn it into a fully fledged DirectShow filter, install it as a Windows component and made it amenable to
construction via the CoCreateInstance(...) mechanism.
The code for the OpenGL renderer filter is virtually identical, so we will
not examine its code here. Chapter 16 will look in some detail at a custom
OpenGL DirectShow renderer filter for presenting stereoscopic movies.
i
i
421
Listing 15.18. The DoRenderSample() method is overridden from the base class
to gain access to the video samples. In the case of OpenGL, the video pixel data is
copied to a RAM buffer.
The Application
The program uses the standard Windows template code to establish a
small window and menu bar. In response to user commands, the application
creates windows to render the OpenGL output and builds the geometry. It
does the same thing using Direct3D. Listing 15.19 shows the code to build
and configure the Direct3D output so that it appears in a floating
(child) window. The code in Listing 15.20 achieves the same thing using
OpenGL.
i
i
422
NULL;
NULL;
NULL;
NULL;
0;
//
//
//
//
//
LRESULT WINAPI D3DMsgProc( HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam ){
switch( msg )
{ // render
.. // other messages
case WM_TIMER:
case WM_PAINT:
RenderDirect3D();
// Update the main window when needed
break;
break;
}
return DefWindowProc( hWnd, msg, wParam, lParam );
}
INT WinD3D(HINSTANCE hInst){ //
Create the window to receive the D3D output
UINT uTimerID=0;
// Register the window class
WNDCLASSEX wc = { sizeof(WNDCLASSEX), CS_CLASSDC, D3DMsgProc, 0L, 0L,
GetModuleHandle(NULL),
LoadIcon(hInst, MAKEINTRESOURCE(IDI_TEXTURES)),
NULL, NULL, NULL, CLASSNAME, NULL };
RegisterClassEx( &wc );
// Create the applications window
g_hWnd = CreateWindow( CLASSNAME, TEXT("Direct3D Window "),
WS_OVERLAPPEDWINDOW, 100, 100, 300, 300,
GetDesktopWindow(), NULL, wc.hInstance, NULL );
if( SUCCEEDED(InitD3D(g_hWnd))){
// Initialize Direct3D
if( SUCCEEDED(InitGeometry())){// Create the scene geometry
// start a timer to render the scene every 15 - 20 ms
uTimerID = (UINT) SetTimer(g_hWnd, TIMER_ID, TIMER_RATE, NULL);
}
}
return 0;
}
Listing 15.19. Initializing the window to receive the Direct 3D output requires setting
up the geometry and creating the Direct3D device and other objects. The initialization code is based on the standard templates in Listings 13.16 and 13.17.
i
i
423
Listing 15.20. Initializing the OpenGL output window. Initialization makes use of
code from the basic template in Listings 13.1 and 13.2.
i
i
424
15.6.2
VOID SetupMatrices(){
..// other matrices are the same as in the template code
D3DXMATRIX matWorld;
D3DXMatrixIdentity( &matWorld );
D3DXMatrixRotationY( &matWorld, // rotate slowly about vertical axis
(FLOAT)(timeGetTime()/1000.0 - g_StartTime + D3DX_PI/2.0));
hr = g_pd3dDevice->SetTransform( D3DTS_WORLD, &matWorld );
}
VOID RenderDirect3D(){
HRESULT hr = S_OK;
if( g_bDeviceLost ){ .... // rebuild the direct 3D if necessary
// Clear the backbuffer and the zbuffer
hr = g_pd3dDevice->Clear( 0, NULL, D3DCLEAR_TARGET|D3DCLEAR_ZBUFFER,
D3DCOLOR_XRGB(bkRed,bkGrn,bkBlu), 1.0f, 0 );
// Begin the scene
hr = g_pd3dDevice->BeginScene();
// Setup the world, view, and projection matrices
SetupMatrices();
hr = g_pd3dDevice->SetTexture( 0, g_pTexture );
hr = g_pd3dDevice->SetStreamSource( 0, g_pVB, 0, sizeof(CUSTOMVERTEX) );
hr = g_pd3dDevice->SetVertexShader( NULL );
hr = g_pd3dDevice->SetFVF( D3DFVF_CUSTOMVERTEX );
hr = g_pd3dDevice->DrawPrimitive( D3DPT_TRIANGLESTRIP, 0, 2*nGrid-2 );
hr = g_pd3dDevice->EndScene();
hr = g_pd3dDevice->Present( NULL, NULL, NULL, NULL );
}
Listing 15.21. Render the mesh (with texture, if it exists) using direct3D.
i
i
425
Listing 15.22. Updated texture coordinates to account for texture sizes that must be of
dimension 2x .
may be necessary (as we have already discussed) to alter the vertices texturemapping coordinates. This is done by the function UpdateMappingD3D() of
Listing 15.22.
15.6.3
To render the textured mesh using OpenGL, the code in Listing 15.23 is
used. Function DrawGeometry() draws a rectangular mesh of quadrilateral polygons with texture coordinates applied. The texture itself is created
from the pixel data stored in the memory buffer pointed to by Screen3 and
placed there by the renderer filter in the movie players FilterGraph. In the
OpenGL rendering thread, visualization is accomplished by calling functions
glBindTexture() and glTexImage2D(), which copy the image pixels from
host RAM into texture memory in the GPU.
To construct an OpenGL version of the renderer filter, the following two
methods from the CBaseRenderer base class are overridden:
CRendererOpenGL::SetMediaType(). This method, called once
when the graph is started, uses the size of the video frame to allocate
i
i
426
Listing 15.23.
Drawing in OpenGL. The WM PAINT message handler calls
DrawSceneOpenGL() and swaps the front and back buffer, as described in Sec-
tion 13.1.2.
i
i
427
a RAM buffer for the pixel data (see Listing 15.24). The size of the
buffer is set to the smallest size suitable to accommodate a texture of
dimension 2x (see Figure 15.7). Global variables maxx and maxy contain the necessary mapping coordinate scaling factors, and BUF SIZEX
and BUF SIZEY define the texture size.
CRendererOpenGL::DoRenderSample(). This method copies the
pixel data from the video sample to the memory buffer (Screen3).
Care must be taken to ensure the address offset between subsequent
rows (the pitch) is applied correctly. Since this method is accessing
Screen3, which could potentially be in use by another thread (i.e., the
main application thread executing the DrawSceneOpenGL() function),
a critical section synchronization object (AutoLock lock(&g cs);) is
deployed to prevent conflict (see Listing 15.25).
method.
i
i
428
Listing 15.25.
The OpenGL custom DirectShow rendering filters method
DoRenderSample().
15.6.4
In this example, where we will use two FilterGraphs in the one program, each
of them is built up in the same way as the other single FilterGraph examples
we have already written. The only new feature relating to this example is the
way in which our custom render filters are inserted in the chain. Listing 15.26
shows the main steps in FilterGraph construction in the case of the filter that
renders into a Direct3D texture.
i
i
429
The OpenGL filter differs only in two lines: Connect the source to a different AVI file and create an instance of an object of the CRendererOpenGL()
class, i.e.:
hr=g_pGB->AddSourceFilter(MediaFileForOpenGL,L"Source File for OGL",
&pSrcFilter);
pRenderer = new CRendererOpenGL(NULL, &hr);
CComPtr<IGraphBuilder> g_pGB;
CComPtr<IMediaControl> g_pMC;
CComPtr<IMediaPosition> g_pMP;
CComPtr<IMediaEvent>
g_pME;
CComPtr<IBaseFilter>
g_pRenderer;
extern TCHAR filenameF3DMovie;
//
//
//
//
//
//
GraphBuilder
Media Control
Media Position
Media Event
our custom renderer
string with movie name
Listing 15.26. Build and run a FilterGraph to play a movie into a Direct3D texture by
using our custom renderer filter, i.e., an instance of class CRendererD3D().
i
i
430
15.6.5
As we approach the end of this chapter, we hope you can see the versatility
of the DirectShow FilterGraph concept and how its SDK and the useful C++
base classes it provides can be used to build a comprehensive array of multimedia application programs. But before we leave this section, we couldnt
resist including a little program that removes one of the FilterGraphs from
our dual-texture movie player and adapts the others renderer filter to play the
same movie into the two textures at the same time. Have a look at the accompanying code for full detail. Listing 15.27 shows the small changes that need
to be made to the Direct3D renderer filter so that it takes the image source
from the Screen3 buffer used by the OpenGL rendering filter.
Listing 15.27. Filter modifications to render, simultaneously, the same video samples
i
i
15.7. Summary
15.7
431
Summary
Bibliography
[1] J. Hart. Windows System Programming, Third Edition. Reading, MA: AddisonWesley Professional, 2004.
[2] M. Pesce. Programming Microsoft DirectShow for Digital Video and Television.
Redmond, WA: Microsoft Press, 2003.
[3] B. E. Rector and J. M. Newcomer. Win32 Programming. Reading, MA:
Addison-Wesley Professional, 1997.
i
i
16
Programming
Stereopsis
i
i
434
16.1
This section will concentrate on briefly reviewing the hardware options for
driving the different display options; for example, with a CRT screen or DLP
projector. However, we found that the CRT devices were the most useful in
terms of delivering stereo images. But of course, we also need to consider
the display adapters, of which there are two main types: stereo-ready and
non-stereo-ready.
16.1.1
Stereo-Ready Adapters
At the time of writing, NVIDIAs FX class GPU-based adapters support all the OpenGL 2
functionality with stereo output, and the lower-end models are relatively inexpensive.
2
http://www.nvidia.com/.
3
http://ati.amd.com/.
4
http://www.3dlabs.com/.
i
i
435
Figure 16.1. Circuit and pin configuration layout for stereo-ready graphics adapters.
the right. There is an agreed standard for this signal and its electrical interface
through the VESA standard connector. The output, shown in Figure 16.1, is
a circular three-pin mini-DIN jack.
Pin 1. +5 Volt power supply capable of driving a current of at least
300mA.
Pin 2. Left/right stereo sync signal (5 volt TTL/CMOS logic level)
high (5V) when the left eye image is displayed; low (0 V) when the
right eye image is displayed. When not in stereo mode, the signal must
be low.
Pin 3. Signal and power ground.
Surprisingly, even through there are many graphics adapter vendors, the
number of suppliers of the key processor chip is much smaller. For example,
the same NVIDIA FX1100 GPU chip is included in adapters from HP5 and
PNY.6 So, there are not only stereo-ready graphics adapters but also stereoready graphics processors. For those adapters that dont use a stereo-ready
GPU, a stereo sync signal must be generated in the display electronics. One
way to do this is to clock a D-type flip-flop on the leading edge of the vertical
sync pulses being sent to the monitor, as illustrated in Figure 16.2. The software driver for such adapters must ensure that the D-input is set to indicate a
left or right field. It can do this by hooking up the D-input to a system port
on the CPU. The drivers vertical sync interrupt service routine (ISR) writes
5
6
http://www.hp.com/.
http://www.pny.com/.
i
i
436
Figure 16.2. Generating stereoscopic sync signals in stereo ready-adapters that do not
have stereo-enabled GPUs can be achieved by adding a flip-flop with one input from
the vertical sync signal and another from the operating system driver using a system
I/O port to indicate left or right field.
to this port to indicate a left or right field. Details of the circuitry can be obtained in the StereoGraphics Hardware Developers Kit documentation [3].
The OpenGL 3D rendering software library (Section 13.1) fully supports
stereo rendering and can report to an application program whether or not the
adapter has this capability. It is the job of the manufacturers device driver to
interface with the hardware sync signal generator. To comply with the stereo
certification process, a display adapter must implement quad buffering (see
Section 16.4), allow for dual Z-buffers, fully support OpenGL and provide a
three-pin mini-DIN connector. So if you have a certified stereo-ready adapter,
it will work with all our example stereoscopic programs.
16.1.2
Non-Stereo-Ready Adapters
Graphics adapters that are not advertised as stereo-ready can still be used for
stereoscopic work. Computer game players use many stereoscopically enabled
products, but since they are not prepared to pay stereo-ready prices, a different approach is adopted. With non-stereo-enabled adapters, the application
programs themselves must support specific stereo delivery hardware, be that
shutter-eye or anaglyph. In many cases, the software product vendor will
have to negotiate with the display adapter manufacturer to build support into
the driver to allow the non-stereo hardware to work with specific applications. This is less satisfactory for application developers, since any program
will have to be rewritten to support specific stereo technology. Nevertheless,
there are one or two general ideas that can be accommodated. StereoGraphics
i
i
437
Figure 16.3. (a) For non-stereo-ready graphics adapters, the last row in the display
raster can be used to indicate with a brightness level whether it is from the left or
right image. (b) Special inline hardware detects the brightness level and synthesizes
the sync signals.
http://www.reald-corporate.com/scientific.
http://www.edimensional.com/.
i
i
438
16.2
Since stereoscopic pictures are basically a combination of two separate pictures, it is possible to adapt any of the single image formats for stereo use.
There are a number that we have heard ofJPS, GIS, BMS, PNSbut there
may be others, too. Each of these is simply an extension of the well-known
JPEG, GIF, BMP, and PNG standards respectively. Simply by changing the
filename extension and using an implied layout for the images, a display program can separate out the left and right parts and present them to the viewer
in the correct way. However, many image file formats allow applications to
embed extra information in what are termed chunks as part of the standard
file headers, and we can use one of these to store information about the stereo
format. The chunks typically give details of such things as image width and
height. The GIF format has what is know as the GIF application extension
to do the same thing. The GIS stereo version of the GIF-encoded image
can be used in just this way. We shall not explore the GIS specification any
further, because as far as photographic pictures are concerned, the JPEG image format (see Section 5.4.1) reigns supreme. The JPEG file specification
is a member of a class of file formats called Exif, and this allows third-party
information chunks to be inserted with the image data that are known as
application markers (APP0, APP1 etc.).
For stereo work, two pictures positioned side by side or one above the
other appear to the JPEG encoding and decoding algorithms as one single
image, twice as wide or twice as high as the non-stereo image. Youve probably spotted a small problem here: how do you know which of the two alternatives are being used? In fact, there are other possibilities too (interlaced, for
example). These alternatives are presented graphically in Figure 16.4. Since
Figure 16.4. A stereoscopic image file format can combine the left and right images by
i
i
Figure 16.5.
439
descriptor.
the JPEG file format allows any number of proprietary information chunks to
be embedded (as we have seen in Section 5.4.1 where the APP3 marker was
defined), it is possible to insert details of the way an image is to be handled
Figure 16.6. The 32-bit JPS stereoscopic descriptor is divided up into four 8-bit fields.
The separation field in bits 2431 gives the separation in pixels between the left
and right images. There is little reason to make this non-zero. The misc flags and
layout fields are the most informative. The type field contains 0x01 to indicate a
stereoscopic image.
i
i
440
when interpreted as a stereo pair. This is exactly what happens in the JPS [4]
stereo picture format, where an APP3 application tag is embedded in a JPEG
file. In this case, the body of the APP3 is structured as shown in Figure 16.5.
(It follows the two-byte APP3 chunk identifier 0xFF3E and the 16-bit integer that gives the length of the remaining chunk, minus the two-byte ID and
length fields of four bytes.) The eight-byte identifier should always contain
the character pattern JPSJPS . The 16-bit length field is the length of that
block. The stereoscopic descriptor is the most important part of the APP3;
it is detailed in Figure 16.6. The remaining part of the APP3 can contain
anything, such as the stereo camera type or eye separation. Specific details on
the stereo format are given in Figure 16.6.
One final comment on the JPS format: if the stereoscopic APP3 marker
is not present in the file then display programs are expected to assume that
the image is in what is called cross-eyed format, where the left and right images
are in side-by-side layout with the right image on the left! The name may
give you a clue as to why it is done in this way. Almost unbelievably, there is a
way to view such images without any special hardware or glasses, but it takes
practice.
To view cross-eyed stereo images:
Display the image in a full-screen mode (right eye image on the
left half of the monitor). Sit about 60 cm away from the screen.
Close your right eye, put your first finger 15 cm in front of your
nose and line it up so that it appears below some feature in the
image on the right of the screen.
Now, close your left eye and open your right eye; your finger
should line up with the same feature in the left side of the screen.
If it doesnt then iterate. Swap back and forth, opening left and
right eyes and moving your finger about until the alignment is
almost correct.
The last step is to open both eyes. Start by focusing on your
finger. Now take it out of the field of view and defocus your eyes
so that the image in the background drifts into focus. This step
is quite hard, because you must keep looking at the point where
your finger was while you defocus.
It works, and it isnt as hard to do as viewing those Magic Eye
stereograms!
i
i
16.3
441
We are not aware of any formats commonly used for recording stereo movies.
On approach weve tried successfully uses two separate files for the left and
right frame sequences. It works well for the lossless compression FLC [2]
format, and there is no reason why the same strategy should not be used for
AVI, MPEG etc. But why go to all the trouble of synchronizing two separate
input streams or having to specify two source files when we can use the same
strategy as in the JPS format for photographs? Since there is no recognized
stereo movie MPEG format, we will just define our own simple one.
Use standard AVI (all codecs are applicable) and assume that
each individual frame is recored in a left-over-right layout.
We choose, and urge you, to adopt left-over-right as opposed to left-besideright because it will make the coding of any program for playing back the
movies much simpler and more efficient. There is a good reason for this,
which becomes much more significant when writing a program to play stereo
movies as opposed to one displaying images. It has to do with the row organization of the memory buffer used to store the decompressed movie frame
before display. In ordering the images in a left-above-right order, all the pixels
in the left image are stored in consecutive memory addresses. Consequently,
the left and right images can be copied to video RAM in one block transfer. If the images were stored left-beside-right, it would be necessary to copy
one row at a time, so that the first half of the row is copied to the left frame
buffer etc. With left-over-right storage, it is easy to manufacture C pointers
to the memory occupied by left and right images. The two pointers behave
as if they pointed to independent buffers, even though they are pointing to
different locations in the same buffer.
With this in mind, the program described in Section 16.5 will assume the
height of a movie frame as defined in the AVI file is twice the actual height
and the left image is in the top half of the frame.
16.4
Displaying images and pictures is at the core of graphical user interfaces. Windows native drawing API (the GDI, or graphics device interface) offers a
number of functions for drawing pictures, but none of them are designed
i
i
442
i
i
443
PIXELFORMATDESCRIPTOR pfd = {
..
};
if (Stereo){ // we want to render in stereo
pfd.dwFlags |= PFD_STEREO;
}
Naturally, a robustly written program should check to see that the application has accepted the request for stereoscopic display. If not, it should fall
back into the default mono mode. There are a wide variety of freely available
libraries and code for reading images stored with various file formats. So here
we will assume that somewhere in our application there are a collection of
routines that extract width, height, stereoscopic format data and image pixel
data into global program variables and RAM memory buffers. (Suitable code
is included along with our own code.) The image pixel data in the RAM
buffer will be assumed to be stored in 24-bit (RGB) format and ordered in
rows beginning at the top left corner of the image and running left-to-right
and top-to-bottom. A global pointer (unsigned char *Screen;) will identify the location of the pixel data buffer. Two long integers (long X,Y;) store
the width and height of the image. In the case of stereoscopic pictures, our
program will make sure that the pixel ordering in the buffer is adjusted so that
the correct images are presented to left and right eyes.
Section 16.2 provided details of the stereoscopic format for image data
organized in JPS files. Since the image part of the JPS format is encoded
using the JPEG algorithm, we can use any JPEG decoder to obtain the pixel
data and fill the RAM buffer. To acquire the stereoscopic format data, the
program must seek out an APP3 application marker and use its information
to determine whether the left and right images are stored side by side or one
above the other. So that our application is not restricted to displaying JPS
files, we make the assumption that for all non-stereo file formats, the image is
stored in left-above-right order.
When reading a file containing implied stereo images in a left-over-right
format, both images may be loaded to RAM in a single step, say to a buffer
pointed at by: char *Screen;. Given that the width is int X; and the
height is int Y;, we can immediately make it look as if we have independent
image buffers by setting Y = Y/2;, ScreenL = Screen; and ScreenR =
(Screen+3*X*Y);. With no further work required, the program can proceed
to render the left and right images using these pointers and dimensions. For
i
i
444
cases involving other stereo image layouts, the program will have to shuffle
the data into the left-over-right organization. We will not show this code
explicitly but instead carry on to use OpenGL to display the stereo image
pairs.
There are two alternative ways in which one can render a pixel array (an
image) using OpenGL:
1. Use the pixel-drawing functions. This is the fastest way to display an
image. The 3D pipeline is bypassed and the pixels are rendered directly
into the frame buffer; only being processed for scaling and position.
It is possible to use logical operations on a per-pixel basis to achieve
blending or masking effects, too. The OpenGL pixel drawing functions expect the pixels to be ordered by rows starting at the bottom of
the image. This will mean that when we use this method to render
pixels from a RAM buffer, it will be necessary to turn the image upside
down before passing it to OpenGL.
Listing 16.1. Initialize the viewing window by setting the screen origin for the image
pixel bitmap (bottom left corner) and any pixel scaling that we wish to employ. Note
that OpenGL is still in essence a 3D rendering system, so we need to set up a view,
in this case an orthographic view. Otherwise, the little rectangle of pixels would be
distorted and not fill the window.
i
i
445
2. Use the image pixel data to make a texture (image) map, apply the
map to a few primitive polygons and render an orthographic view that
fills the viewport. This second approach is not as crazy as it sounds, because it can lead to some really interesting effects. For example, you can
render a video source into a texture and apply that texture to dynamically deforming objects to achieve some very interesting video mixing
and blending effects. We will use this in some of the projects in Chapter 18.
The first approach will be used in this example. It requires a small amount
of additional initialization code to define the viewport and set the pixel scaling. Listing 16.1 provides the detail.
Listing 16.2. Rendering stereoscopic images is accomplished by writing the left and
right image pixels into the left and right back buffers.
i
i
446
case WM_SIZE:
// respond to window size changes
// window only changes size in NON full screen mode
if(!FullScreen){
glViewport(0,0,(GLfloat)LOWORD(lParam), (GLfloat)HIWORD(lParam));
// change pixel size and aspect ration to fill window
if(X > 0 && Y > 0)
glPixelZoom((GLfloat)LOWORD(lParam),/(GLfloat)X,
(GLfloat)HIWORD(lParam),/(GLfloat)Y);
else
// in case X or Y are zero
glPixelZoom((GLfloat)1.0,(GLfloat)1.0);
InvalidateRect(hWnd,NULL,FALSE); // redraw the OpenGL view
}
break;
Listing 16.3. Handling the WM SIZE message is only required when rendering into a
resizable window.
16.5
The move player that we are going to put together in this section will build
on a number of other examples we have already discussed. Before getting into
details, it is useful to consider what alternatives we have at our disposal:
Rendering. There is no practical alternative for flexibly rendering stereo
movies other than to use OpenGL in the same way that we have just
done for stereoscopic images. We can use the same software design and
build the rest of the application so that it interfaces to the stereoscopic
i
i
447
rendering routines via a pointer to a RAM buffer storing the pixel data
from the video frame and global variables for width and height.
Decompression. To write a file parser and decompression routine to
support all the codecs one might encounter is impractical. As we discussed in Section 5.5.2, the philosophy of providing extensible components allows any player program to call on any installed codec, or
even to search the Internet and download a missing component. Under Windows, the simplest way to achieve this is to use the DirectShow API.
Audio. A movie player that only produces pictures just isnt good
enough. Nor is one that doesnt manage to synchronize the video and
audio tracks. Again, we have seen in Chapter 15 that DirectShow delivers everything we need in terms of audio processing.
Timing. Synchronization between sound and vision is not the only
timing issue. The movie must play at the correct speed, and if it looks
like the rendering part of a program cant keep up then frames must
be skipped, not slowed down. Again, DirectShow is designed to deliver this behavior, and if we were to write a custom rendering filter
using OpenGL with stereoscopic output, we would gain the benefit of
DirectShows timing.
Stereo file format. We will assume that each frame in the AVI file is in
the format discussed in Section 16.3. This makes sense because the
left-over-right organization allows for the most efficient use of memory
copy operations, thus maximizing the speed of display.
Having examined the alternatives the conclusion is obvious:
Use DirectShow to decode the video data stream and play the
audio. Use OpenGL to render the stereoscopic video frames.
In Section 15.3, we coded a simple DirectShow movie player application
that used the built-in video rendering filter at the end of the chain. For
the stereo movie player, we will use as much of that program as possible
but replace the built-in DirectShow renderer filter with a custom one of
our own that formats the video information as 24-bit RGB pixel data in
bottom-to-top row order and writes it into a frame buffer in system RAM. A
i
i
448
Figure 16.7. The main features and data flow of the stereo-movie player application.
small modification to the stereo image display program will be all that is required to provide the visualization of the video using OpenGL pixel drawing
functions. Figure 16.7 illustrates the block structure and data flow for the
program.
This program uses the stereo-image display programs code as its starting point. The application contains a message processing loop and it has a
small main window (as before). It also creates a child window to display the
contents of the RAM frame buffer using OpenGL. A timer is used to post
WM PAINT messages 50 times per second to the display window and hence
render the video frames in the RAM buffer. At this point in the code, if the
FilterGraph has not been built or is not running then no movie is playing the
RAM buffer will be empty and the window (or screen) will just appear blank.
Figure 16.8. The FilterGraph for the AVI stereo movie player. Only the custom
rendering filter at the end of the chain is unique to this player.
i
i
449
// Although a GIUD for the filter will not be used we must still declare
// one because the base class requires it.
struct
__declspec(uuid("{C7DF1EAC-A2AB-4c53-8FA3-BED09F89C012}")) CLSID_OpenGLRenderer;
class CRendererOpenGL : public CBaseVideoRenderer
{
public:
// Class constuctor and destructor
// pointer to application object
CRendererOpenGL(CMoviePlayer *pPres,
LPUNKNOWN pUnk,HRESULT *phr);
CRendererOpenGL();
public:
Listing 16.4. The derived class for the renderer filter. The member variables record
the dimension of the movie/video frames and the pitch specifies how the addresses in
RAM change from one row of pixels to the next. This allows the rows in an image to
be organized into memory blocks that are not necessarily stored contiguously.
i
i
450
16.5.1
If you arent familiar with DirectShow and its filter concept, now would be
a good time to read Chapter 15 and in particular Section 15.6, where we
describe a monoscopic version of this filter.
CRendererOpenGL::CRendererOpenGL(
// constructor
// pointer to application object
CMoviePlayer *pPres,
LPUNKNOWN pUnk,
HRESULT *phr )
//
: CBaseVideoRenderer(__uuidof(CLSID_OpenGLRenderer), // configure base class
NAME("OpenGL Renderer"), pUnk, phr)
// initialize member variable
, m_pCP( pPres){
// simply notify - everything OK
*phr = S_OK;
}
CRendererOpenGL::CRendererOpenGL(){
}
// do nothing
Listing 16.5. The video renderer class constructor and destructor. The class construc-
tor has very little work to do. It simply initializes the base class and one of the
member variables.
Listing 16.6. Check media type: return S OK when a suitable video format has been
enumerated.
i
i
451
Listing 16.7. Set media type: extract information about the movies video stream and
allocation RAM buffers ready to hold the decoded frame images.
i
i
452
Listing 16.8. The DoRenderSample() method is overridden from the base class to
i
i
453
global variables
Left and Right buffers
Movie size
Window handles
Listing 16.9. An overview of the key execution steps in the stereo movie player application. This is a skeletal view of the key points only, so that you can follow the
execution flow in the full code.
i
i
454
16.5.2
We turn now to the remainder of the application. There are a number of ways
in which we could implement it. We choose to follow the framework outlined
in Listing 16.9 because it is minimal and should allow easy customization and
adaption. It follows the general design philosophy for all the programs in this
book:
The application opens in WinMain(). A window callback function, MainWindowMsgProc(), is identified through a Windows
class. The program then goes into a message loop.
Two of the main windows menu commands are of note.
The DisplayMovie() function brings about the creation of a
child window to contain the rendered stereo movie. This allows
either windowed or full-screen display. The child windows message handling function, OpenGLWndMSgProc(), processes the
WM PAINT message and renders the output of the pixel memory
buffer ScreenL/R.
A timer posts a WM TIMER message to OpenGLWndMSgProc() every few milliseconds to re-render the frame buffer using OpenGL.
If no movie is playing into this buffer (i.e., ScreenL=NULL;), the
window simply remains blank.
The function RunMovie() is responsible for building the DirectShow FilterGraph and running it. The graph will use our
custom renderer filter to play a movie in a loop into the screen
buffer, identified by pointer ScreenL.
Now lets focus on the movie player code in function RunMovie();.
i
i
16.5.3
455
The CMoviePlayer class (Listing 16.11) allows us to conveniently represent all the elements which go to make a DirectShow FilterGraph as a single
object.
Special note should be taken of the use of the Active Template Librarys
(ATL) COM interface pointer CComPtr<>, for example in the statements:
CComPtr<ICaptureGraphBuilder2>
CComPtr<IGraphBuilder>
m_pCG;
m_pGB;
i
i
456
It takes this form because we chose to stop short of making a filter that
can be registered with Windows as a system component, but in every other
respect it acts like any other DirectShow filter.
i
i
457
HRESULT CMoviePlayer::BuildMoviePlayerFilterGraph(){
// ALL hr return codes should be checked with if(FAILED(hr)){Take error action}
HRESULT hr = S_OK;
CComPtr<IBaseFilter> pRenderer; // custom renderer filter
CComPtr<IBaseFilter> pSrcFilter; // movie source reader
CComPtr<IBaseFilter> pInfTee;
// filter to split data into two
CComPtr<IBaseFilter> pAudioRender;
CRendererOpenGL *pCTR=0;
// Custom renderer
hr = m_pGB.CoCreateInstance(CLSID_FilterGraph, NULL, CLSCTX_INPROC);
// Get the graphs media control and media event interfaces
m_pGB.QueryInterface(&m_pMC); // media Control Interface
m_pGB.QueryInterface(&m_pME); // media Event Interface
m_pGB.QueryInterface(&m_pMP); // media Position Interface
// Create the OpenGL Renderer Filter !
pRenderer = (IBaseFilter *)new CRendererOpenGL(this, NULL, &hr);
// Get a pointer to the IBaseFilter interface on the custom renderer
// and add it to the existing graph
hr = m_pGB->AddFilter(pRenderer, L"Texture Renderer");
hr = CoCreateInstance (CLSID_CaptureGraphBuilder2 , NULL, CLSCTX_INPROC,
IID_ICaptureGraphBuilder2, (void **) &(m_pCG.p));
// Attach the existing filter graph to the capture graph
hr = m_pCG->SetFiltergraph(m_pGB);
USES_CONVERSION;
// Character conversion to UNICODE
hr=m_pGB->AddSourceFilter(A2W(MediaFile),L"Source File",&pSrcFilter);
// split the data stream into video and audio streams
hr = CoCreateInstance (CLSID_InfTee, NULL, CLSCTX_INPROC,
IID_IBaseFilter, (void **) &pInfTee);
hr=m_pGB->AddFilter(pInfTee,L"Inf TEE");
hr = m_pCG->RenderStream(0,0,pSrcFilter,0,pInfTee);
IPin *pPin2;
// get output pin on the splitter
hr = m_pCG->FindPin(pInfTee,PINDIR_OUTPUT,0,0,TRUE,0,&pPin2);
///////////////// audio section
hr = CoCreateInstance (CLSID_AudioRender, NULL, CLSCTX_INPROC,
IID_IBaseFilter, (void **) &pAudioRender);
hr=m_pGB->AddFilter(pAudioRender,L"AudioRender");
hr = m_pCG->RenderStream (0, &MEDIATYPE_Audio,
pSrcFilter, NULL, pAudioRender);
///////////////////// end of audio
hr = m_pCG->RenderStream (0, &MEDIATYPE_Video, // Connect all the filters
pPin2, NULL, pRenderer);
// together.
hr = m_pMC->Run();
// Start the graph running - i.e. play the movie
return hr;
}
Listing 16.12. Build the FilterGraph using the DirectShow GraphBuilder object and
its associated CaptureGraphBuilder2 COM object. Note that every hr return code
should be checked to make sure that it is has returned an S OK code before proceeding. If an error code is returned, appropriate error action should be taken.
i
i
458
16.5.4
Control
The final listing of this section (Listing 16.13) presents the remaining methods of the CMoviePlayer class which are concerned with controlling the
presentation of the movie: playing it, restarting it or positioning the playback point. Most of these tasks are handled through the IMediaControl,
IMediaPosition and IMediaSeeking interfaces, which offer a large selection of helpful methods, only a few of which are explictly listed in our code.
Full information is available in the DirectX SDK (DXSDK).
void CMoviePlayer::Cleanup(void){
if( m_pMC )m_pMC->Stop();
}
// restart
Listing 16.13. The other methods in the CMoviePlayer class. Constructor, destruc-
tor and method to use the Media Control interface for player control functions such
as stop, play etc.
i
i
16.6
459
So far, we havent said much about how to acquire stereoscopic images. Chapter 10 discussed some special cameras and quite expensive HMDs with two
built-in video cameras. In this section, we will develop a little program to let
you use two inexpensive webcams for capturing stereoscopic images.
The application is based, to a considerable extent, on a combination of
the code in Section 15.4.1, where we used a single camera to grab images, and
some elements from the previous section on playing stereo movies. For this
application, the implementation strategy is:
Build two FilterGraphs to capture live video from different sources
and render them into two memory frame buffers, for the left and
right eyes respectively. Use an OpenGL stereo-enabled window
for preview if desired (or two windows for individual left/right
video). On command from the user, write a bitmap (or bitmaps)
from the frame buffers to disk.
Figure 16.9. The block structure and data-flow paths for the stereo image capture
application.
i
i
460
The concept is summarized in Figure 16.9, which illustrates how the previewed output from the stereo renderer is delivered to two OpenGL windows,
one previewing the left cameras output and the other the right cameras output. The full program code offers the option to use stereo-enabled hardware.
For now, though, we will examine the program that presents its output in two
windows.
Note: If one uses two OpenGL windows in a single Windows
application program, they should normally execute in separate
threads. (We discussed threading in Section 15.1.1.) Consequently, we will write our application as a multi-threaded one.
struct
__declspec(uuid(\"{36F38352-809A-4236-A163-D5191543A3F6}"))CLSID_OpenGLRenderer;
class CRendererOpenGL : public CBaseVideoRenderer
{
public:
// Class constuctor and destructor
CRendererOpenGL(CMoviePlayer *pPres,
// NEW VARIABLE FOR THREAD/CAMERA Identity
long id,
LPUNKNOWN pUnk,HRESULT *phr);
...
// same methods
private: // member variables
// identify whether rendering left or right camera
// copy from "id"
long m_instance;
CGrabCamera * m_pCP;// replace pointer to CMoviePlayer
....
};
CRendererOpenGL::CRendererOpenGL( // constructor
CGrabCamera *pPres,
// POINTER TO CAMERA OBJECT
long *id, // IDENTIFY LEFT OR RIGHT CAMERA
LPUNKNOWN pUnk,
HRESULT *phr )
:CBaseVideoRenderer(__uuidof(CLSID_OpenGLRenderer),
NAME("OpenGL Renderer"),pUnk,phr)
, m_pCP( pPres)
, m_instance(id){
// NEW !!
... // constructor body - same as movie player
}
Listing 16.14. Modification to the renderer filter for use in the stereo image grab pro-
gram. This listing only highlights the changes for the class and its implementation.
The omitted code is the same as in Listings 16.4 to 16.8.
i
i
461
It makes sense to place the whole of the left and right capture
tasks in different threads. The threads can write into different
parts of the same frame buffer or into separate left and right
frame buffers. It makes little difference.
Before looking at the code structure, we will quickly consider a minor
change to the OpenGL renderer filter class. The changes are highlighted in
Listing 16.14. In it, we see that the application class pointer is changed to
reflect the different application name, CGrabCamera * m pCP;. An extra
member variable long m instance is used to identify whether this instance
of the filter is part of the graph capturing the left image or the graph capturing
the right image. When it comes to copying the sample into ScreenL or
ScreenR or allocating the screen pointers, identifier long m instance shows
the way.
To form the programs overall structure, we will make it as much like
the movie player as possible. All of the work required to handle a single
camera source (its DirectShow FilterGraph including the OpenGL rendering
filter etc.) will be encapsulated in a C++ class, and an object of this class
will be created to handle it. Listing 16.15 gives the CGrabCamera class and
Listing 16.16 brings all the programs features together to show the flow of
execution. Some remarks about a few of the statements in Listing 16.16 are
appropriate:
i
i
462
class CGrabCamera {
public:
CGrabCamera( );
CGrabCamera();
// render into the OpenGL window
HRESULT Render(HWND,long, long);
// choose the image format we wish to obtain from the camera
HRESULT SetMediaFormat( GUID subtype);
// window handle for window rendering camera preview
HWND
m_hwnd;
// left or right instance
long
instance;
// build the image capture graph
HRESULT BuildCaptureGraph(long);
void Cleanup(void);
private:
// methods
HRESULT CaptureVideo(IBaseFilter *pRenderer, long id);
HRESULT FindCaptureDevice(IBaseFilter **ppSrcFilter, long *nfilters, long id);
void CheckMovieStatus(void);
CComPtr<ICaptureGraphBuilder2>
CComPtr<IGraphBuilder>
CComPtr<IMediaControl>
CComPtr<IMediaEvent>
;
m_pCG;
m_pGB;
m_pMC;
m_pME;
//
//
//
//
Listing 16.15. Class specification for the object which will acquire video frames from
i
i
463
0,parameter) function is the preferred one in Visual C++. Its arguments carry the name of a function to be the entry point for the thread
and a parameter that will be passed on as the argument to the thread
function. The Win32 API has its own thread-creation function that
mirrors beginthread very closely.
i
i
464
i
i
465
Thread identifiers. Each thread essentially uses the same code and the
same message handler. So that the program can identify which thread
is which (first or second), each thread is assigned an identity (ID) during its creation. The ID is passed to the thread function as part of
the THREAD DATA structure. Additionally, the preview window handler
must know whether it is rendering left or right images, and so this
structure is also passed to it using the small user data long word which
every window makes available. This is written by:
SetWindowLongPtr(hwnd,GWLP USERDATA,(LONG PTR)pp);
16.6.1
With one small exception, all the components of the FilterGraph that are
needed to capture images from a live video source are present in the movie
i
i
466
player example. The code in Listing 16.17 will build the FilterGraph. The
most significant difference between this and earlier examples is the call to
FindCaptureDevice(). The code of this function is designed to identify
appropriate video sources. If id=1, it will return a pointer to the first suitable
source, used for the left images. When id=2, the function returns a pointer
Listing 16.17. Constructing a FilterGraph to capture images from a video source using
a custom renderer filter. Note the slightly different way that the ATL CCComPtr
com pointer is used when creating the Graph and GraphBuilder objects.
i
i
467
to the second suitable source. Naturally, all function return values (the hr=
codes) need to be checked to see that everything worked as expected.
The last comment to make on this example is simply to say that a command given to the main window, via its menu for example, can initiate a process to write to disk the images in the ScreenL and ScreenR frame buffers.
This could be in the form of Windows bitmaps or by using a compressed or
lossy format, such as JPEG.
16.7
i
i
468
Figure 16.10. The FilterGraph for the AVI stereo movie compilation program.
source filter and how to insert the image files into the video stream. Once we
have written the custom source filter to read the stereo image pairs, it fits into
the FilterGraph concept that should be becoming familiar to you. Indeed, the
graph for this application is not going to be significantly different from those
we have already seen. Its details are set out in Figure 16.10.
In this example, we will start with a look at the overall application structure in Listing 16.18. The strategy is to make up a list of the file names for
the bitmap image pairs. This we do by providing the first part (the root) of
the name for the files. Then we add a sequence number at the end and put
back the file name extension. Most 3D movie software uses this idea to write
image sequences. For example, write each movie frame to an individual file:
fileX001.bmp, fileX002.bmp etc. for frames 1, 2 etc.
In our code, we detect these sequences in function buildBMPfilelist()
and store their names in the *FileList1[] and *FileList2[] character arrays. The rest of the code is mostly similar to what we have seen before.
The new element concerns the source filter shown diagrammatically in Fig-
i
i
469
// Window handler
LRESULT CALLBACK WndMainProc(HWND hwnd,UINT msg,WPARAM wParam,LPARAM lParam){
switch (msg){
case WM_GRAPHNOTIFY: HandleGraphEvent(); break;
// Stop capturing and release interfaces
case WM_CLOSE:
CloseInterfaces(); break;
break;
case WM_DESTROY:
PostQuitMessage(0); return 0;
}
return DefWindowProc (hwnd , message, wParam, lParam);
}
// main entry point
int PASCAL WinMain(HINSTANCE hInstance,HINSTANCE hIP, LPSTR lpCL, int nS){
if(FAILED(CoInitializeEx(NULL, COINIT_APARTMENTTHREADED))) exit(1);
RegisterClass( ..
// usual Windows stuff
ghApp = CreateWindow(...
// usual Windows stuff
buildBMPfilelist(hInstance,ghApp);
// build the list of files to be used
BuildFilterGraph(hInstance);
// build and run the assembly filter graph
while(1){ ... }
// message loop
CoUninitialize();
return 0;
}
// When the graph has finished i.e. there are no more input files
HRESULT HandleGraphEvent(void){
// and event will be generated and the application will pause
LONG evCode, evParam1, evParam2;
// this function monitors the event and sends the user a message
HRESULT hr=S_OK;
// that the movie is built.
if (!g_pME)return E_POINTER;
while(SUCCEEDED(g_pME->GetEvent(&evCode,(LONG_PTR *)&evParam1,
(LONG_PTR *)&evParam2,0))){
switch (evCode){
case EC_COMPLETE: MessageBox(NULL,"Complete","Output",MB_OK); break;
}
hr = g_pME->FreeEventParams(evCode, evParam1, evParam2);
}
return hr;
}
static void buildBMPfilelist(HINSTANCE hInstance,HWND hWnd){
.. // Specify filename roots e.g. "SeqRight" "SeqLeft" then the functions
.. // GetFilesInSequence will build up the list with SeqRight001.bmp
.. // SeqRight002.bmp etc. etc. and SeqLeft001.bmp SeqLeft002.bmp etc. etc.
nBitmapFiles1=GetFilesInSequence(NameRoot1,FileList1);
nBitmapFiles2=GetFilesInSequence(NameRoot2,FileList2);
}
}
i
i
470
Listing 16.19. Building and executing the FilterGraph. Once finished, an event is sent
to the application windows message handler, which it uses to pop up an alert message
that the job is done and the program may be terminated.
16.7.1
A source filter is a little different from a renderer filter. The source filter
pushes the video samples through the FilterGraph; it is in control. Following
our discussion in Section 15.1 on DirectShow filters, the idea of push and
pull in a FilterGraph and how filters connect through pins should not be a
surprise. A source filter is written by deriving a class for the filter from a
suitable base class and overriding one or more of its methods to obtain the
specific behavior we want.
In the case of a source filter, we must derive from two classes: one from
CSource for source filters and one from the source pin class CSourceStream.
In our program, we call them CBitmapSourceFilter and CPushPinOnBSF,
respectively. There is very little to do in the CSourceStream class, just the
i
i
471
Listing 16.20. Filter classes: the source class and the filter pin class.
i
i
472
struct
__declspec(uuid("{4B429C51-76A7-442d-9D96-44C859639696}"))
CLSID_BitmapSourceFilter;
// Filters class constructor and destructor
CBitmapSourceFilter::CBitmapSourceFilter(int Nframes, IUnknown *pUnk,
HRESULT *phr)
: CSource(NAME("BitmapSourceFilter"), pUnk,
__uuidof(CLSID_BitmapSourceFilter)){
// create the source filters output pin
m_pPin = new CPushPinOnBSF(Nframes, phr, this);
}
// delete the output pin
CBitmapSourceFilter::CBitmapSourceFilter(){ delete m_pPin;}
// Filters Pin class constructor
CPushPinOnBSF::CPushPinOnBSF(int nFrames, HRESULT *phr, CSource *pFilter)
// initialize base class
: CSourceStream(NAME("Push Source BitmapSet"), phr, pFilter, L"Out"),
m_FramesWritten(0),
// initialize - no frames written
m_bZeroMemory(0),
// no buffer yet to clear
m_iFrameNumber(0),
// no frames read
m_rtDelta(FPS_25),
// PAL frame rate 25 frames per second
m_bFilesLoaded(FALSE) // no files loaded
// // Use constructor to load the first bitmap pair.
.. //
LoadNextBitmap();
m_bFilesLoaded=TRUE;
nFilesLoaded++;
}
CPushPinOnBSF::CPushPinOnBSF(){ // filters pin destructor
// close any open files - free memory etc ...
}
void CPushPinOnBSF::LoadNextBitmap(void){
.. // load next bitmap in sequence
m_bFilesLoaded=TRUE;
}
i
i
473
i
i
474
Listing 16.22. Pushing the bitmap pixeldata into the AVI data stream for further filtering, compressing and ultimately archiving to disk.
i
i
16.8. Summary
16.8
475
Summary
In this chapter, we have looked at the design and coding of a few Windows
application programs for stereopsis. There are many other stereoscopic applications we could have considered. For example, we could add a soundtrack
to the stereo movie or make a stereoscopic webcam [1]. Nevertheless, the
programs we have explored here cover a significant range of components that,
if assembled in other ways, should cover a lot of what one might want to do
for ones own VR system.
Please do not forget that in the printed listings throughout this chapter, we have omitted statements (such as error checking) which, while vitally
important, might have obfuscated the key structures we wanted to highlight.
Bibliography
[1] K. McMenemy and S. Ferguson. Real-Time Stereoscopic Video Streaming. Dr.
Dobbs Journal 382 (2006) 1822.
[2] T. Riemersma. The FLIC File Format. http://www.compuphase.com/flic.
htm, 2006.
[3] StereoGraphics Corporation. Stereo3D Hardware Developers Handbook.
http://www.reald-corporate.com/scientific/developer tools.asp, 2001.
[4] J. Siragusa et al. General Purpose Stereoscopic Data Descriptor. http://www.
vrex.com/developer/sterdesc.pdf, 1997.
i
i
17
Programming
Input and
Force
Feedback
So far in Part II, everything that we have discussed has essentially been associated with the sense of sight. But, whilst it is probably true that the visual
element is the most important component of any VR system, without the
ability to interact with the virtual elements, the experience and sense of reality is diminished. We have seen in Section 4.2 that input to a VR system
can come from a whole range of devices, not just the keyboard or the mouse.
These nonstandard input devices are now indispensable components of any
VR system. However, even equipping our VR systems with devices such as
joysticks, two-handed game consoles or custom hardware such as automobile
steering consoles, we dont come close to simulating real-world interaction.
To be really realistic, our devices need to kick back; they need to resist when
we push them and ideally stimulate our sense of touch and feeling. Again,
looking back at Section 4.2, we saw that haptic devices can provide variable
resistance when you try to push something and force feedback when it tries
to push you, thus mediating our sense of touch within the VR.
In this chapter, we will look at how the concepts of nonstandard input,
custom input and haptics can be driven in practice from a VR application
program. Unfortunately for VR system designers on a budget, it is difficult to
obtain any devices that, for touch and feel, come close to proving the equiv477
i
i
478
alent of the fabulously realistic visuals that even a modestly priced graphics
adapter can deliver. But in Section 17.5, we will offer some ideas as to how
you can, with a modest degree of electronic circuit design skills, start some
custom interactive experiments of your own.
From the software developers perspective, there are four strategies open
to us so that we can use other forms of input apart from the basic keyboard
and mouse. These are:
1. DirectX, which was previously introduced in the context of real-time
3D graphics and interactive multimedia. Another component of the
DirectX system, DirectInput, has been designed to meet the need that
we seek to address in this chapter. We will look at how it works in
Section 17.1.
2. Use a proprietary software development kit provided by the manufacturer of consoles and haptic devices. We shall comment specifically on
one these in Section 17.3.1 because it attempts to do for haptics what
OpenGL has done for graphics.
3. There is a class of software called middleware that delivers system independence by acting as a link between the hardware devices and the
application software. The middleware software often follows a clientserver model in which a server program communicates with the input
or sensing hardware, formats the data into a device-independent structure and sends it to a client program. The client program receives the
data, possibly from several servers, and presents it to the VR application, again in a device-independent manner. Often the servers run
on independent host computers that are dedicated to data acquisition
from a single device. Communication between client and server is usually done via a local area network. We explore middleware in a little
more detail in Section 17.4.
4. Design your own custom interface and build your own hardware. This
is not as crazy as it seems, because haptics and force feedback are still
the subject of active R&D and there are no absolutes or universally
agreed standards. We shall look at an outline of how you might do this
and provide a simple but versatile software interface for custom input
devices in Section 17.5.
We begin this chapter by describing the most familiar way (on a Windows
PC) in which to program multi-faceted input and force feedback devices; that
is, using DirectInput.
i
i
17.1. DirectInput
17.1
479
DirectInput
DirectInput is part of DirectX and it has been stable since the release of DirectX 8. It uses the same programming paradigm as the other components of
DirectX; that is:
Methods of COM interfaces provide the connection between
application programs and the hardware drivers. An application
program enumerates the attached input devices and selects the
ones it wants to use. It determines the capabilities of the devices. It configures the devices by sending them parameters, and
it reads the current state of a device (such as joystick button
presses or handle orientation) by again calling an appropriate interface method.
DirectInput not only provides comprehensive interfaces for acquiring data
from most types of input devices, it also has interfaces and methods for sending feedback information to the devices, to wrestle the user to the ground or
give him the sensation of stroking the cat. We will give an example of using
DirectInput to send force feedback in Section 17.2, but first we start with an
example program to illustrate how two joystick devices might be used in a VR
application requiring two-handed input. Since DirectInput supports a wide
array of input devices, one could easily adapt the example to work with two
mice or two steering-wheel consoles.
DirectInput follows the philosophy of running the data acquisition process in a separate thread. Therefore, an application program is able to obtain
the state and settings of any input device, for example the mouses position,
by polling for them through a COM interface method.
In the remainder of this section, we will develop a small collection of
functions that can be used by any application program to enable it to acquire
(x, y, z) coordinate input from one, two or more joystick devices. The example will offer a set of global variables to hold input device data, and three
functions to:
1. initialize DirectInput and find joysticks with InitDirectInput(..);
2. release the DirectInput COM interfaces with FreeDirectInput(..);
3. update the global variables by polling for the input devices position
and button state with UpdateInputState(..).
All the DirectInput code will be placed in a separate file for convenience.
i
i
480
A minimal joystick device will have two position movements, left to right
(along an x-axis) and front to back (along a y-axis), along with a number of
buttons. A different device may have a twist grip or slider to represent a z-axis
rotation. It is the function of the DirectInput initialization stage to determine
the capability of, and number of, installed joystick-like devices. Other devices,
mice for example or mice with force feedback, should also be identified and
configured. It is one of the tasks of a well designed DirectInput application
program to be able to adapt to different levels of hardware capability.
17.1.1
Since DirectInput uses COM to access its API, a DirectInput device object
is created in a similar way to the Direct3D device. In the function
InitDirectInput(..), the main task is to enumerate all joystick devices
and pick the first two for use by the application. Device enumeration is standard practice in DirectX. It allows an application program to find out the
capabilities of peripheral devices. In the case of joystick devices, enumeration is also used to provide information about the axes that it supports; for
example, moving the stick to the left and right may be returned as an x-axis.
Listing 17.1 sets up the framework for device initialization and enumeration,
and Listing 17.2 tells DirectInput the name of the callback functions which
permit the application to select the most appropriate joystick, or in this case
the first two found.
During axis enumeration, the callback function takes the opportunity
to set the range and dead zone. If, for example, we set the x-axis range to
Listing 17.1. The DirectInput COM interface is created, all attached joysticks are
i
i
17.1. DirectInput
481
[1000, 1000] then when our program asks for the current x-coordinate,
DirectInput will report a value of 1000 if it is fully pushed over to the left.
A dead zone value of 100 tells DirectX to report as zero any axis position determined to be less than 100. In theory, a dead zone should be unnecessary,
but in practice the imperfections in cheap hardware can mean that the spring
i
i
482
Listing 17.2. For joystick devices, the number of available device axes has to be deter-
mined and configured so that values returned to the program when movement occurs
along an axis fall within a certain range and have known properties.
i
i
17.1. DirectInput
483
17.1.2
Acquiring Data
Listing 17.3. Function to obtain the current axis positions and state of buttons pressed.
Before we can obtain any information about a joystick, it must be acquired by the
program. Otherwise, there could be a conflict with another application also wishing
to use a joystick device.
i
i
484
17.2
Normally, one associates force feedback with highly specialized and expensive
equipment, but it can be achieved in even modestly priced joystick devices because they are targeted at the game market and are in mass production. These
devices cannot deliver the same accurate haptic response as the professional
devices we discussed in Chapter 4, but they are able to provide a reasonable
degree of user satisfaction. Since DirectInput was designed with the com-
i
i
485
puter game developer in mind, it offers a useful set of functions for controlling force-feedback devices. In terms of the haptics and forces we discussed in
Section 4.2, DirectInput can support constant forces, frictional forces, ramp
forces and periodic forces.
NULL;
NULL;
NULL;
NULL;
Listing 17.4. To find a joystick with force feedback, a device enumeration procedure
similar to that for non-force-feedback devices is performed. The source code for
CreateForceFeedbackEffects() is outlined in Listing 17.5.
i
i
486
Listing 17.5. Create a constant force effect that can be played out to a joystick device.
i
i
487
i
i
488
static
static
static
static
INT
INT
INT
INT
g_nXForce;
g_nYForce;
g_mForce;
nForce=0,nForce1=0,nForce2=0;
Listing 17.6. Configuring and playing different types of forces in response to a simple
i
i
489
feel a 3D object, pick it up and move it around, one needs to be able to sense
the 3D position of the point of contact with the object and apply reactionary
forces in any direction. Chapter 4 described a few such devices, and we will
now briefly look at the types of programming interfaces that have evolved to
drive them.
17.3
Haptic devices with multiple degrees of freedom and working under different operating conditions (desktop devices, or wearable devices, or custom
devices) do not yet conform to a well-documented and universally agreed set
of standard behaviors and software interface protocols. The things they do
have in common are that they need to be able to operate in real time, move in
three-dimensional space, apply forces in any direction and provide accurate
feedback of their 3D position and orientation. The term haptic rendering (or
force rendering) is used to describe the presentation of a scene that can be felt
rather than one that can be seen. Haptics involves an additional complexity in
that not only should the device present its user with a scene that she can feel,
but it should also be able to allow its user to push, poke, prod and sculpt that
scene. To put it another way, the device should not just oppose the movement
of the user with a force to simulate a hard surface. It should be able to adapt
to the movement of the user so as to simulate deformable or elastic material,
or indeed go so far as to facilitate dynamics, friction and inertia as the user
pushes a virtual object around the virtual world, lifts it or senses its weight.
Imagine a virtual weight lifting contest.
We would not expect all these things to be doable by the haptic device
hardware on its own, but in the future, just as todays 3D graphics cards can
deliver stunningly realistic images to fool our eyes, we hope that a similar device will be available to fool our senses of touch and feel. In the meantime, the
software that controls the haptic devices has a crucial role to play. Our software design goal should be programs that can, as far as possible, emulate the
ideal behavior of a haptic device to completely facilitate our sense of presence.
This is not a simple thing to achieve. As well as driving the haptic device, the
application program has to:
Store a description of the 3D scene in terms of surfaces that can be
touched and felt.
Calculate collisions between the virtual position of the end effector (the
avatar) of the haptic device and the objects surfaces.
i
i
490
http://www.sensable.com/.
http://www.sensegraphics.com/.
i
i
491
that the object is not as stiff as one would wish. More seriously, however, it
can cause instability in the operation of the device. We briefly discussed this
problem in Section 4.2.1. In practice, the haptic APIs use multi-threading,
because it allows them to drive the servo motors from a high-priority thread
that performs very little work other than to keep the servos correctly driven
and report back the 3D location of the devices end effector within its working
volume. The Ghost and OpenHaptics systems run what they call a servo loop
at a rate of 1 kHz.
Some of the multi-threaded tasks a haptics application has to do might be
hidden within a device driver in much the same way that DirectInput handles
the hardware. However, a typical haptic application may need to run up to
four concurrent threads:
1. A thread to render the scene visually. This runs at 30 Hz.
2. A fast force response and basic collision detection thread. This thread
drives the servos and does the most simple of collision detection with a
few bounding volumes such as spheres and planes. It typically runs at
rates of between 300 Hz and 1 kHz.
3. A collision detection thread which calculates collisions between those
parts of an object that are close to the avatar. This will be at a more
detailed level than the collision detection in thread 2.
4. And finally, the simulation thread which performs such things as modeling surface properties, stiffness, elasticity, push and pull etc.
17.3.1
OpenHaptics
The OpenHaptics [16] system library from SensAble provides two software
interfaces for Phantom devices. These are called the Haptics Device
API (HDAPI) and Haptic Library API (HLAPI). The HDAPI is a low-level
library which allows an application to initialize the haptic device, start a servo
loop thread and perform some basic haptic commands, such as get the position of the end effector p, calculate a force as a function of p (for example, a
coulombic force field) and pass that force to the servo loop. The HDAPI can
execute synchronous calls, useful for querying the state of user operable push
buttons, or asynchronous calls, such as rendering the coulombic force field.
i
i
492
The HLAPI follows the OpenGL syntax and state machine model. It
allows the shapes of haptic surfaces to be described by points, lines and triangular polygons, as well as sets material properties such as stiffness and friction.
It offers the possibility of taking advantage of the feedback, depth rendering
and the simultaneous 3D presentation which OpenGL offers. Indeed, there is
such a close connection between being able to see something and touch something that the analogy with OpenGL could open the door to a much wider
application of true haptic behavior in computer games, engineering design
applications, or even artistic and sculpting programs.
OpenHaptics executes two threads, a servo thread at 1 kHz and a local collision detection thread which makes up simple local approximations
to those shapes that lie close to the avatar. The collision detection thread
passes this information to the servo thread to update the force delivered by
the haptic device. OpenHaptics carries out proxy rendering as illustrated
in Figure 17.1. The proxy follows the position of the haptic device, but it is
constrained to lie outside all touchable surfaces. The haptic device, because of
servo loop delays, can actually penetrate objects. The proxy will iterate from
its previous position to a point on the surface closest to the haptic devices
current position. (When the haptic device lies outside a touchable object,
its position and the proxy position will coincide.) When the haptic device
lies inside a touchable object, a restoring force using the spring damper force
model will try to pull the haptic device back to the surface point where the
proxy lies.
i
i
493
Figure 17.2. A flow diagram of a typical haptic program that uses the HLAPI. The
process events block handles messages from the haptic device and will react to an exit
message. The OpenGL render stage also renders a Z-depth buffer view from the
avatar which may be used for collision detection.
i
i
494
Although the commercially available desktop haptic devices and forcefeedback gloves are reducing in cost and becoming more available, you may
still wish to design, build and use your own custom input devices. There are
two possible approaches. First, a large body of knowledge from the opensource community can be tapped into through the use of middleware, or you
may wish to work directly with a Windows PC. In the next two sections, we
will explore the middleware concept and outline how a few simple electronic
components can form an interface to the Windows PC through its universal
serial bus (USB). In particular, well see how the USB can be accessed from an
application program written in C without the need to write a device driver.
17.4
Middleware
Figure 17.3. By using a middleware program library, the VR system can be distributed
over a number of hosts. Each input device is controlled by its own dedicated controller. The input data is broadcast over a local area network and used as required by
the VR software application.
i
i
17.4. Middleware
495
that it allows the input devices to be physically remote from the VR application that uses the data. This effectively allows a VR system to operate in
a distributed manner, in which input and display devices are connected to
different hosts. Nevertheless, to the user, the system still appears to act as a
single, tightly integrated entity. Figure 17.3 illustrates how this might work
in practice.
Two mature, flexible and open source middleware tools that are easily
built and extended are the Virtual-Reality Peripheral Network (VRPN) [20]
and OpenTracker [9]. They both provide a degree of device abstraction and
can be combined with a scene graph (see Section 5.2) visualization package
to deliver powerful VR applications with only modest effort on the part of
the developer. A helpful aspect of the architecture in these libraries is their
use of the client-server model. This allows an individual computer to be
dedicated to the task of data acquisition from a single device. Thus, the VR
program itself receives its inputs in a device-independent fashion via a local
area network (LAN). We will briefly offer comment on these two helpful
pieces of software.
17.4.1
VRPN
The Virtual-Reality Peripheral Network system (see Taylor et al. [18]) was
developed to address concerns that arise when trying to make use of many
types of tracking devices and other resource-intensive input hardware. These
concerns included:
ease of physical access to devices, i.e., fewer wires lying around;
very different hardware interfaces on devices that have very similar
functions;
devices may require complex configuration and set-up, so it is easier to
leave them operating continuously; and
device drivers may only be available on certain operating systems.
VRPN follows the client-server model in Figure 17.3 and has functions
that can be built into the server program to manage the data acquisition from
a large selection of popular hardware (see Section 4.3). For example, an Ascension Technologys Flock of Birds motion tracker can be connected to a
VRPN server running under Linux via the /dev/tty0 serial device. The
i
i
496
VRPN server will multicast its data onto an Ethernet LAN which can then
be picked up on a Windows PC running a VRPN client.
A good example application that demonstrates all these features is distributed with the OpenSceneGraph library [10]. The code for OpenSceneGraphs osgVRPNviewer project shows how to link directly to a VRPN server
that reads its data from a spaceball device executing on the local host. Since
VRPN and OpenSceneGraph are open-source applications, we suggest you
download both of them and experiment with your own tracking and input
devices.
17.4.2
OpenTracker
17.5
i
i
497
17.5.1
The USB interface uses a serial protocol, and the hardware connections consist of four wires: +5V power and ground and two data lines called D+
and D. The interface, depending on how it is configured, can be low,
full or high speed. Low-speed transfers operate at 1.5 Mbits/s, full-speed at
12 Mbits/s and high-speed 480 Mbits/s. However, in addition to data, the
bus has to carry control, status and error-checking signals, thus making the
actual data rates lower than these values. Therefore, the maximum data rates
are only 800 bytes per second for low speed, 1.2 megabits per second for full
speed and 53 megabits per second for high speed. The USB Implementers Forum3 provides a comprehensive set of documentation on USB and has many
links to other useful resources.
There are four types of data transfers possible, each suited to different
applications, which may or may not be available to low-speed devices. These
transfers are control, bulk, interrupt and isochronous. Devices such as keyboards, mice and joysticks are low-speed devices, and use interrupt data transfer. The maximum possible transfer of data for this combination is 8 bytes
per 10 milliseconds.
When a peripheral device is connected to a PC via USB, a process called
enumeration takes place. During enumeration, the devices interface must
send several pieces of information called descriptors to the PC. The descriptors
are sets of data that completely describe the USB devices capabilities and
how the device will be used. A class of USB devices falls into a category
called human interface devices (HIDs), not surprisingly because these devices
interact directly with people. Examples of HIDs are keyboards, mice and
joysticks. Using the HID protocol, an application program can detect when
someone moves a joystick, or the PC might send a force-feedback effect that
the user can feel.
3
http://www.usb.org/.
i
i
498
With many custom-made USB devices, the user has to create a driver for
Windows. However, Windows has supported the HID class of USB devices
since Windows 98. Therefore, when a USB HID is plugged into a PC running Windows, it will build a custom driver for the device. As a result, if we
can construct a simple device that fools Windows into believing it is dealing
with an HID, there will be no need to write custom device drivers and we can
use DirectInput or write our own high-level language interface in only a few
lines of code.
17.5.2
USB Interfacing
i
i
499
Figure 17.4. A USB to analog voltage interface using a PIC. On the left is the circuit
diagram of the device which converts the analog voltage on five potentiometers and
one thumb wheel switch into a USB compatible data frame of nine bytes delivered to
the PC at a rate of 1K bytes/s. No external power supply is necessary and the voltage
scale of [0, 5V ] is converted to an unsigned byte value ([0, 255]).
speed, the places to which you can take your virtual world are only restricted
by your imagination and the ability to construct your human interface device
design. Apart from soldering a few electronic components, perhaps the hardest
thing to do is write, develop and test the firmware that has to be written for
the PIC. A good development system is essential. For example, the Proton+
Compiler [11] and some practical guidance may also be found in the PIC
microcontroller project book [4].
At the core of the microcontroller code must lie routines that allow its
USB interface to be enumerated correctly by Windows when it connects to
the PC. This is done in a series of stages by passing back data structures
called descriptors. Mack [5] gives a full program listing for PIC microcode
that enumerates itself as a joystick to Windows. During enumeration, the
PC requests descriptors concerning increasingly small elements of the USB
device. The descriptors that have to be provided for a HID device so that it
appears to Windows as a joystick are:
The device descriptor has 14 fields occupying 18 bytes. The fields tell
the PC which type of USB interface it provides, e.g., version 1.00.
If a USB product is to be marketed commercially, a vendor ID (VID)
must be allocated by the USB Implementers Forum. For hardware
that will never be released commercially, a dummy value can be used.
For example, if we assign 0x03E8 as the VID then an application
program may look for it in order to detect our device (see Listing 17.7).
i
i
500
#include "setupapi.h"
#include "hidsdi.h"
Listing 17.7. Functions to initialize the USB device, acquire file handles to be used
for reading data from the device, and start the thread functions that carry out the
reading operation.
i
i
501
The configuration descriptor has 8 fields occupying 9 bytes. This provides information on such things as whether the device is bus-powered.
If the PC determines that the requested current is not available, it will
not allow the device to be configured.
The interface descriptor has 9 fields occupying 9 bytes. It tells the host
that this device is an HID and to expect additional descriptors to provide details of the HID and how it presents its reports. (For example,
in the case of a joystick, the report describes how the nine-byte data
frame is formatted to represent the x-, y-, z-coordinates, button presses
etc.)
Field
Usage page
Usage
Collection
Usage page
Usage
Logical Minimum
Logical Maximum
Physical Minimum
Physical Maximum
Report size
Report count
Input
Usage page
Usage
Collection
Usage
Usage
Usage
Usage
Usage
Report count
Input
End collection
End collection
Value
0x05
0x09
0xA1
0x05
0x09
0x15
0x26
0x35
0x46
0x75
0x95
0x81
0x05
0x09
0xA1
0x09
0x09
0x09
0x09
0x09
0x95
0x81
0xC0
0xC0
Description
Generic Desktop
Joystick
Application
Simulation controls
Throttle
0
255
0
255
Number of bits = 8
Number of above usages = 1
Data, variable, absolute
Generic Desktop
Pointer
Physical
x-axis
y-axis
z-axis
Slider
RZ
Number of above usages = 5
Data, variable, absolute
Attribute
0x01
0x04
0x01
0x02
0xBB
0x00
0x00FF
0x00
0x00FF
0x08
0x01
0x02
0x01
0x01
0x00
0x30
0x31
0x32
0x36
0x35
0x05
0x02
Table 17.1. The report descriptor of a USB HID that is interpreted as emanating
from a three-axis joystick. The value parameter tells the enumerating host what is
being defined and the Attribute parameter tells the host what it is being defined as.
For example, a descriptor field value of 0x09 tells the host that a joystick axis is being
defined and an attribute of 0x30 says it is an x-axis.
i
i
502
Figure 17.5. A USB device configured with the device descriptor given in Table 17.1
will appear to Windows like a joystick device and can be calibrated as such and used
by DirectInput application programs in the same way that a joystick would.
i
i
503
17.5.3
i
i
504
threads do for controlling force rendering. How the device is initialized and
data acquisition threads are started is shown in Listing 17.7. The application
program searches for two devices with appropriate product IDs (these match
the ones set in the device descriptors as part of the PIC firmware). Function FindUSBHID(..) searches all the installed HIDs looking for the desired
vendor and product ID. Function FindHID(int index) (see the full code
listings on CD for the source code of this function) calls into the Windows
DDK to obtain a file handle for the first or second HID device installed on
extern
extern
extern
extern
float
float
float
BOOL
x_rjoy_pos;
y_rjoy_pos;
z_rjoy_pos;
bLoop;
Listing 17.8. The data acquisition function. The USB devices appear as standard
conventional Windows files, thus a ReadFile(h1,...) statement will retrieve
up to nine bytes from the latest USB frame obtained by Windows from the device.
Executing the read command four times purges any buffered frames. Doing this
ensures that we get the latest position data.
i
i
17.6. Summary
505
the computer, as defined by the argument index. Once the last device has
been checked, the function returns the code INVALID HANDLE VALUE. To get
the vendor and product ID, the DDK function HidD GetAttributes() is
called to fill out the fields in a HIDD ATTRIBUTES structure. The structure
members contain the information we have been looking for.
17.6
Summary
In this chapter, we have looked at a varied assortment of programming methods for acquiring human input from sources other than the traditional keyboard and mouse. A sense of touch, two-handed input and force feedback
are all vital elements for VR, whether it be on the desktop or in a large-scale
environmental simulation suite. In the next and final chapter, we shall seek
to bring together the elements of graphics, multimedia and a mix of input
methods in some example projects.
Bibliography
[1] J. Axelson. USB Complete: Everything You Need to Develop Custom USB Peripherals, Third Edition. Madison, WI: Lakeview Research, 2005.
[2] HAL. Haptic Library. http://edm.uhasselt.be/software/hal/.
[3] J. Hyde. USB Design by Example: A Practical Guide to Building I/O Devices,
Second Edition. Portland, OR: Intel Press, 2001.
[4] J. Iovine. PIC Microcontroller Project Book: For PIC Basic and PIC Basic Pro
Compilers, Second Edition. Boston, MA: McGraw-Hill, 2004.
[5] I. Mack. PhD Differentiation Report. School of Electrical Engineering,
Queens University Belfast, 2005. http://www.ee.qub.ac.uk/graphics/reports/
mack report 2005.pdf.
[6] Microchip Technology Inc. DS41124C. http://ww1.microchip.com/
downloads/en/devicedoc/41124c.pdf, 2000.
[7] Microsoft Corporation. DirectX SDK. http://msdn.microsoft.com/directx/
sdk/, 2006.
[8] W. Oney. Programming the Microsoft Windows Driver Model, Second Edition.
Redmond, WA: Microsoft Press, 2002.
[9] OpenTracker. http://studierstube.icg.tu-graz.ac.at/opentracker/, 2006.
i
i
506
Haptic
Devices.
http://sensable.com/
i
i
18
Building on the
Basics, Some
Projects in VR
In this, the final chapter, we thought it would be useful to show you how
the code and examples in Chapters 13 through 17 can be rearranged and
combined to build some additional utility programs. We will also discuss
how all the elements of desktop VR can be made to work together and the
issues that this raises in terms of program architecture and performance. We
wont be printing code listings in this chapter for two main reasons. First, the
book would be too long, and second, we would only be repeating (more or
less) a lot of the code you have already seen.
All the code, project files, help and readme comments can be found on the
accompanying CD. We have written our programs using Microsofts Visual
Studio 2003.NET and Visual C++ Version 6. Although Visual Studio 2005
is the latest version of this tool, we chose to use 2003 .NET and VC 6 for
our implementation because VS 2005 can easily read its predecessor versions
project files but the converse is not true.
So, enough said about where to find the codes for our projects; lets have
a look at them in detail.
507
i
i
508
18.1
Video Editing
We start with video editing. If you are compiling some video content for
your VR environment, you will need to be able to mix, match and collate
some disparate sources.
On the face of it, a video-editing application should be a very substantial
project. There are many commercial applications that have had years of work
devoted to their development. If all you want is a good video editing tool
then the code we include for our little project is not for you. However, if
you have followed our examples in Chapter 15 on using DirectShow and you
want to build some editing capability into an application in your VR program
portfolio, you could not do better than use the DirectShow Editing Services
(DES) that are part of DirectShow. Microsofts Movie Maker program uses
this technology.
18.1.1
Figure 18.1. A DES application program selects a number of input movies and sound
files. It arranges all of them, or sections (clips) from them, in any order and mixes
them together over time. The application can present a preview of the arrangement
and compose an output movie file containing the arrangement and with a specified
resolution and format.
i
i
509
Figure 18.2. DES video editing. Several video clips, each in an AVI file are combined
into a single video stream. The clips are placed on a track and starting times defined.
The final video is produced by superimposing track 3 on track 2 on track 1. In any
gaps between the clips on track 3 the composite of tracks 1 and 2 will appear. A
transition interval can be used to mix or wipe between tracks.
Figures 18.1 and 18.2 illustrate everything we need to know about DES.
A few COM interfaces provide all the editing functionality one could wish
for. DES is based on the concept of the timeline. A DES application program
lays out the video clips along this timeline by providing a time for the clip
to start and a time for it to finish. Clips are concatenated into tracks, and if
there are any time gaps between clips, a blank section is included in the track.
Clips cannot overlap in time within a track. The tracks are mixed together
to form the video composition1 using a layer model and another DES element
called a transition. Video compositions are joined with a parallel structure for
the audio tracks to form the final movie.
The layer-mixing model works similar to the way images are layered
in a photo-editing program. A first track is laid down against the timeline
(gaps between clips appear as blanks), and this becomes the current composition. As the second track is processed, it overwrites the current composition
in those intervals where they overlap in time unless a transition is used to
set a time interval during which the incoming track gradually replaces the
current composition. When this process is complete, we have a new and updated current composition. This process is repeated until all the tracks have
been processed. Figure 18.2 illustrates the mixing process and the DES
structure.
1
DES allows a hierarchical combination of compositions, but this facility is something we
dont need in a basic video editor.
i
i
510
18.1.2
Recompression
Whilst the DES can be used to change the compression codec used in an
AVI file, it is a bit of an overkill. It is easier to build a simple DirectShow
FilterGraph to do the same thing. All one needs are a source filter, the compression filter of choice and a file-writing filter. The smart rendering capability of the GraphBuilder object will put in place any necessary decrypting
filters. To select the compression filter, our application creates a selection dialog box populated with the friendly names of all the installed compression
filters, which we can determine through enumeration. Then a call to the media control interfaces Run() method will run the graph. We have seen all the
elements for this project before, with the exception of telling the FilterGraph
that it must not drop frames if the compression cannot be accomplished in
real time. This is done by disengaging the filter synchronization from the FilterGraphs internal clock with a call to the SetSyncSource(NULL) method
on the IMediaFilter interface.
18.2
For the most part, we rely on the fact that nearly all graphics adapters have two
video outputs and that the Windows desktop can be configured to span across
the two outputs. Stereo-enabled HMDs, like that illustrated in Figure 10.1,
have separate VGA inputs for left and right eyes, and these can be connected
to the pair of outputs on the video card. To generate the stereoscopic effect,
no special hardware is required; we simply ensure that images and movies are
presented in the-left-beside right format illustrated in Figure 16.4.
Thus, when an image or movie is shown in a full-screen window, its left
side will appear in one of the video outputs and its right side in the other. As
a result, any movie or image display program that scales the output to fill the
desktop will do the job.
18.3
Video Processing
i
i
18.4. Chroma-Keying
511
Figure 18.3. The video editing project allows the AVI clip in (a) to be resized, as in
(b) with any aspect ratio, or a section of it cut out and rendered (c), or some image
process applied to it, such as dynamic or fixed rotations (d) and (e).
filter for inclusion in a DES application. The idea behind the application is
to use DirectShow to decode an AVI video file and render it into an OpenGL
texture. We showed how to do this in Section 15.6. This mesh can be animated, rotated, rolled upalmost any morphing effect could be created as
the movie is playing in real time. By capturing the movie as it is rendered in
the onscreen window, a new movie is generated and stored in an AVI file. To
capture the movie from the renderer window; the program uses a FilterGraph
with the same structure we used in Chapter 15 to generate an AVI movie from
a sequence of bitmaps.
Taking advantage of OpenGLs texturing means that the processed movie
will be antialiased and small sections of the movies frame could also be magnified. In addition, if we call on the GPU fragment processor (through the
use of OpenGL 2 functions), we will able to apply a wide range of computationally intensive imaging processes in real time. Figure 18.3 illustrates a few
features of this project.
The comments in the project code will guide you through the application.
18.4
Chroma-Keying
i
i
512
Figure 18.4. The FilterGraph for the chroma-key project. The video RAM buffer
content is combined with the AVI movie buffer content by chroma-keying on a blue
or green key color in the video buffer.
i
i
18.4. Chroma-Keying
513
Figure 18.5. Chroma-key in action. At the top left is a frame from a movie. The
dialog below sets the color thresholds. At (a), the camera image shows a desktop (a
yellow ball sits in front of a screen showing a green color). When the chroma-key
is turned on, at (b), parts of the green screen that are visible are replaced with the
frames from the AVI movie.
i
i
514
18.5
Figure 18.6. Dividing up an image for display in four pieces. Overlaps are used so that
the projected pieces can be blended. The master machine drives the two projectors
displaying the central area. The slave machine drives the two projectors showing the
extreme left and extreme right image segments.
2
We will do this by setting up one wide desktop, rendering the output to go to one of the
projectors from the left side of the desktop and the output for the other projector from the
right half of the desktop.
i
i
515
and the two peripheral sections are displayed by a slave computer under the
remote control of the master. This arrangement is described in Figure 18.6.
As we have noted before, it is necessary to arrange some overlap between the
images. This complication necessitates sending more than one half of the
image to each of the two pairs of projectors.
We dont have space here to go into a line-by-line discussion of how the
code works. As you might guess, however, we have already encountered most
of the coding concepts we need to use3 in previous chapters. It is now only
a matter of bringing it together and adapting the code we need. By knowing
the overall design of the project software, you should be able to follow the
fine detail from the documentation on the CD and comments in the code.
18.5.1
Since we are going to be driving four projectors, we will need two PCs, each
with a dual-output graphics adapter. The PCs will be connected together
through their Ethernet [2] interfaces. You can join the PCs in back-to-back
fashion or use a switch/hub or any other network route. In our program, we
assume that the PCs are part of a private network and have designated private
IP addresses [5] of 192.168.1.1 and 192.168.1.2. The two machines will
operate in a client-server model with one of them designated the server and
the other the client. On the server machine, a program termed the master will
execute continuously, and on the client machine another program termed the
slave will do the work. It is the job of the master program to load the images
(or movies), display their central portion and serve the peripheral part out to
the slave. It is the job of the slave to acquire the peripheral part of the image
(or movie) from the master and display it4 .
Whilst the same principles apply to presenting images, movies or interactive 3D content across the multi-projector panorama, we will first discuss the
display of wide-screen stereoscopic images. The software design is given in
block outline in Figure 18.7. In Section 18.5.3, we will comment on how the
same principles may be used to present panoramic movies and 3D graphics.
3
Actually, we have not discussed Windows sockets and TCP/IP programming before but
will indicate the basic principles in this section.
4
We dont want to complicate the issue here, but on the CD you will find a couple of other
utilities that let us control both the master and the slave program remotely over the Internet.
You will also see that the architecture of the master program is such that it can be configured
to farm out the display to two slave programs on different client machines. It could even be
easily adapted to serve out the display among four, eight or more machines.
i
i
516
Figure 18.7. A block outline of the software design for the master/slave quad-projector
display. The steps labeled 1 to 5 illustrate the sequence of events which occur when
an image (or sequences of images are displayed.)
i
i
517
send the part of the image that the slave is to display. The second
thread will handle control messages to be sent to the client.
The slave also starts two communications threads. One will retrieve the
image from the master; the other waits for commands from the master
and posts messages into the message loop to execute those commands.
When the master is instructed to display an image, it reads and decodes
the pixel data into an RAM buffer (stereoscopic images are stored in
left-over-right format). This is labeled 1 in Figure 18.7.
The master sends a control signal to the slave to tell it that a new image
is ready to be grabbed (Step 2).
The slave tells the master to dispatch the part of the image it needs
to form the peripheral edges of the display (Step 3). The slaves local
buffer is filled (Step 4).
The master and slave apply the distortion and blending to their respective parts of the image and send the output to their pair of display
projectors (Steps 5a and 5b).
18.5.2
Distortion and blending are configured manually using a test image and grid.
The parts of the image, as shown in Figure 18.6, are rendered as a texture
on a quadrilateral mesh. By interactively moving the corners of the mesh
and controlling the bending along the edges, the shape of the projection is
altered to facilitate some overlap between adjacent projections and correct for
a nonflat screen. The application program and these features are partially
shown in action in Figure 4.11. The system only has to be configured once.
Parameters to control blending and overlap between adjacent projections are
recorded in a file. This data is used to automatically configure the projections
each time the programs start.
Displaying images and movies as objects painted with an OpenGL texture
works very well. It may be used for a stereoscopic display and can easily cope
with the refresh rates required for video output. Rendering the output of a
3D graphics program requires an additional step. Since we must render an
OpenGL-textured mesh to form the output, we cannot render a 3D scene
into the same output buffers. Thus, we do not render the scene into either
the front or back frame buffers; instead, we render it into an auxiliary buffer.
i
i
518
(Most graphics adapters that support OpenGL have at least two auxiliary
buffers.) The content of the auxiliary buffer is then used as the source of
a texture that will be painted onto the mesh object. This may involve an
extra step, but it does not usually slow down the application, because none of
these steps involve the host processor. Everything happens within the graphics
adapter.
18.5.3
Synchronization
Synchronization, command passing and data distribution between the master and slave programs are accomplished using TCP/IP5 over Ethernet. Our
application uses the Windows sockets6 programming interface, and because
it is multi-threaded, we dont have to worry too much about the blocking action that used to beset Windows sockets programming. The master and slave
programs open listening sockets on high-numbered ports and wait for data or
commands to be sent from the other side.
We use our own simple protocol for sending commands by encoding the
action into an eight-byte chunk; for example, to initiate a data transfer. The
slaves control thread waits in an infinite loop using the recv(...) function,
which only returns once it gets eight bytes of data from the server which were
sent using the send(..) function. The slave acts upon the command by
using PostMessage(...) to place a standard Windows WM COMMAND message into the message loop. When the slaves control thread has posted the
message, it goes back to sleep, again waiting on the recv(...) function
for the next command. All the other master/slave conversations follow this
model. When requesting that an image be sent by the master to the slave, the
eight-byte data chunk tells the slave how much data to expect and the two
programs cooperate to send the whole image. The TCP protocol guarantees
that no data is lost in the transmission.
At first glance, the code for all these client-server conversations look quite
complex, but that is only because of the care that needs to be taken to check
for errors and unexpected events such as the connections failing during transmission. In fact, we really only need to use four functions to send a message
from one program to another: socket(..) creates the socket, connect()
5
Transmission Control Protocol/Internet Protocol is a standard used for the communication between computers, in particular for transmitting data over networks for all Internetconnected machines.
6
Windows sockets (Winsock) is a specification that defines how Windows network software should access network services, particularly TCP/IP.
i
i
519
connects through the socket with a listening program, send(...) sends the
data and closesocket(..) terminates the process.
The receiving code is a little more complex because it usually runs in a
separate thread, but again it basically consists of creating a socket, binding the
listener to that socket with bind(..), accepting connections on that socket
with accept(..) and receiving the data with recv(..).
Along with the main programs, you will find an example of a small pair
of message sending and receiving programs. This illustrates in a minimal way
how we use the Windows sockets for communication over TCP/IP.
So far, we have concentrated on displaying image sequences using two
PCs and four video outputs. To extend this idea to similar applications that
present panoramic movies and 3D content poses some difficulties:
Movies must present their images at rates of 30 fps. Even the smallest
delay between the presentation of the central and peripheral parts of
the frame would be quite intolerable. Wide-screen images require a
high pixel count, too; for example, a stereoscopic movie is likely to
require a frame size of 3200 1200 pixels, at the minimum. The
movie player project we include with the book restricts the display to a
two-output stereoscopic wide-screen resolution of 1600 1200, so it
requires only a single host. To go all the way to a four output movie
player, it would be necessary to render two movies, one for the center
and one for the periphery. Each would need to be stored on its own PC,
and the master and slave control signals designed to keep the players in
synchronism.
In a four-output real-time 3D design, the communications channel
would need to pass the numerical descriptions of the 3D scene and
image maps from master to slave. Under most conditions, this data
distribution would only need to be done in an initialization stage. This
transfer might involve a considerable quantity of data, but during normal usage (with interaction or scripted action, for example), a much
smaller amount of data would need to be transferredcarrying such
details as viewpoint location, orientation, lighting conditions and object locations. Our example project code restricts itself to using a single
host and projecting onto a curved screen with two projectors.
In this section, we have highlighted some key features of a suite of core
programs to deliver a multi-projector cave-type immersive virtual environ-
i
i
520
ment. Larger commercial systems that can cost in excess of $100k will typically implement these functions in custom hardware. However, the common
PCs processing and GPU capability now allow anyone with the right software to experiment with VR caves at a fraction of that cost. More details of
how this project works are included in the commentary on the CD.
18.6
In Section 8.4, the OpenCV and Vision SDK software libraries were introduced. These two libraries can be easily integrated into programs that need
special purpose functionality such as overlays and removing perspective distortion. Since the subjects of image processing and computer vision could
each fill a book, we are only going to provide a framework for using OpenCV,
typically for processing moving images.
Our examples come in the form of a special inline DirectShow filter and
two short applications programs that use the filter. The programs comprise:
A filter. As part of the OpenCV distribution [6], a DirectShow filter called ProxyTrans provides a method to apply any of the OpenCV
image processing functions. Unfortunately, it didnt work for us, and
so we provide our own simpler code based on a minor modification
of the EZRGB24 sample that ships as part of DirectX. By adding one
additional interface method to the EZRGB24 project, our filter allows
any program which inserts it into a FilterGraph datastream to define
a callback function. The callback function is passed as a pointer to
an OpenCV image structure representing the image samples. Once we
have access to every image sample passing through the FilterGraph, it is
possible to apply any OpenCV process to it. Please consult the readme
file for details of how the component is registered with the operating
system. A registered DirectShow filter may be used by any program
(and in the graphedit utility).
AVI processor. This example shows how to use our filter in a scenario
where we wish to process a movie stored in an AVI file. It previews the
action of the image process and then writes it into another AVI file.
i
i
521
With the exception of illustrating how to instantiate our filter component and use the callback function, all the elements of the program
have been discussed in earlier chapters. The same applies to our next
program.
Camera-processor. This example illustrates how to apply OpenCV image processing functions to a real-time video source.
18.7
Figure 18.8. The combination of a basic webcam stuck to a simple pair of video
classes forms a low cost form of head mounted display suitable for the augmented
reality project.
i
i
522
Figure 18.9. Placing miniature mesh figures onto a real desktop illustrates the capability of an AR head-mounted display.
18.7.1
The ARToolKit
At the core of the ARToolKit is the action of placing the 3D object into the
field of view of a camera so that it looks as if the object were attached to a
marker in the real world. This is best illustrated with an example. First we
see the scene with the square marker indicated on the white sheet of paper in
Figure 18.10(a).
The software in the ARToolKit produces a transformation matrix that,
when applied to an OpenGL scene, makes it look as if the object were centered on the marker and lying in the plane of the marker. (The marker is
actually a black outline with the word Hiro inside; this detail is not clear
in Figures 18.9 and 18.10). The nonsymmetric text in the marker allows a
definite frame of reference to be determined. When we render an OpenGL
mesh, it will appear to sit on top of the marker (see Figure 18.10(b)). If we
move the piece of paper with the marker around (and/or move the camera)
then the model will appear to follow the marker, as in Figure 18.10(c).
i
i
523
Figure 18.10. In the ARToolKit, a marker in the cameras field of view (a) is used to
define a plane in the real world on which we can place a synthetic object so that it
looks like it is resting on the marker (b). Moving the marker causes the object to
follow it and so gives the illusion that the synthetic object is actually resting on the
marker.
i
i
524
glBegin(GL_LINES);
glVertex3f(0.0,0.0, 0.0);
glVertex3f(0.0,0.0,10.0);
glEnd();
Figure 18.11. Coordinate frames of reference used in the ARToolKit and the 3D
18.7.2
i
i
525
It is possible to build this pattern information for other custom markers; you should
consult the ARToolKit documentation for details on how to do this. It is also possible to
instruct the ARToolKit to look for several different markers in the same image; again, consult
the documentation for information on how to do this.
i
i
526
Figure 18.12. The steps performed by the ARToolKit as it tracks a marker in the
cameras field of view and determines the mapping homology between the plane in
which the marker lies and the cameras projection plane.
18.8
Virtual Sculpting in 3D
To finish off this chapter, we thought wed put together a small project for a
desktop VR system that brings together some of the concepts we have been
developing in the last few chapters. Actually, it is quite a large project It is a
desktop program that brings together input from two joysticks that will control a couple of virtual tools that we can use to push and pull a 3D polygonal
mesh in order to sculpt it into different shapes. The display will be stereoscop-
i
i
527
ically enabled and force feedback is simulated via the joysticks. Mathematical
models will be applied to the deformation of the mesh (as we push and pull
it) to make it look like it is made out of elastic material.
As has been the case with all the other projects proposed in this
chapter, we are not going to examine the code line by line; its far too long
for that. But, after reading this section and the previous chapters, you should
be able to follow the logic of the flow of execution in the program using the
commentary and comments within the code and most importantly make
changes, for example to experiment with the mathematical model of
deformation.
Figure 18.13 illustrates how our project looks and behaves. In (a). a few
shapes made up from polygon meshes will be formed into a new shape by
pushing and pulling at them with the tools. In (b), after a bit of prodding,
the shapes have been roughly deformed. In (c), part of the mesh is visible
behind some of the shapes. Bounding boxes surround the shapes in the scene.
By testing intersections between bounding boxes, the speed of the collision
detection functions is increased considerably. These simple shapes are loaded
from 3D mesh files, so it is possible to use almost any shape of mesh model.
Now, we will briefly highlight some key points in the design of the
program.
i
i
528
18.8.1
The Idea
One of the key points about an application like this is that it must operate
in real time, so there are always going to be compromises that will have to be
made in designing it. The program has to run continuously, and it should
repeat the following tasks as quickly as possible:
1. Read the position and orientation of the joystick devices that are controlling the sculpting implements/tools.
2. Check for collision between the tools and the elastic models.
3. Since the tools are regarded as solid and the models as deformable, if
they collide (and where they overlap), move the vertices in the model
out of the way. Use a mathematical algorithm to determine how other
parts of the mesh model are deformed so as to simulate the elastic behavior.
4. Send a signal back to the joysticks that represent the tools. This signal
will generate a force for the user to feel. On simple joysticks, all that
we can really expect to do is simulate a bit of a kick.
5. Render a 3D view of the scene, i.e., the elastic models and sculpting
tools.
The projects program code must do its best to follow the logic of these
steps. There are a number of functions to render the scene. These are called
in response to a WM PAINT message. They act almost independently from the
rest of the code. When the program needs to indicate a change in the scene,
typically as a result of moving the tools or the math model of elasticity moving
some of the vertices in the mesh, a WM PAINT message is dispatched.
The main action of the program is driven by a WM TIMER message that
calls function RefreshScene() in which all these steps take place. Timing is
critical, and so the rate set for this timer is a very important design parameter.
If it is too slow, we will observe an unpleasant lag between our action in the
real world and the implements in the virtual sculptor. On the other hand, if
it is too fast, there will be insufficient time for the collision detection routines
to do their work or the mathematical elasticity simulation to come up with
an answer.
Choosing the timer interval is one of a number of critically important
decisions that have to be made in designing a program like this. The others
i
i
529
relate to which values to use for the parameters in the mathematical model
of elastic behavior. There is a good theoretical basis for the choice of many
parameter values in the mathematical model, but often in practice we have
to compromise because of the speed and power of the computers processor.
In our code, the timer setting and mathematical model parameters we quote
have evolved after experimentation, so you should feel free to see what effect
changing them has on your own system. In order to do this, we will give
a brief review of the theory underlying the core algorithm simulating elastic
deformation.
18.8.2
i
i
530
turn the mesh to its original shape when we release the vertex at
point p . In a simple one-dimensional spring (Figure 18.14(c)),
the restoring force is proportional to the extension in the length
of the spring. In Figure 18.14(d), when two or more springs
are connected, all the nonfixed points such as p1 may move. All
new positions must be obtained simultaneously; in this 1D case
an analytic solution of the math model may be obtained. In Figure 18.14(e), if the springs forms a complex 3D network then
the effect of a distortion must be obtained using a numerical solution technique. In Figure 18.14(f ), when the point p is moved
to p , the forces in all attached springs will change and the shape
of the network will adjust to a new state of equilibrium.
This model fits very well with our polygonal representation of shapes.
The edges in our mesh model correspond to the little springs connected into
a 3D network by joining the vertices together. We think of the original configuration of the model as being its rest state, and as we pull one or more of
the vertices, the whole network of springs adjusts to take on a new shape, just
as they would if we really built the model out of real springs and pulled or
pushed on one of them.
The behavior of this network of springs is modeled mathematically by
using the well-known laws of Hooke and Newton. Hooke tells us that the
restoring force F in a spring is proportional to the extension in the spring
p. That is, F = k p, where k is the constant of proportionality. Newton
tells us that the acceleration a of a particle (in our case a vertex in our mesh)
with a mass m is proportional to the force applied to it. That is, F = ma. We
also know that acceleration is the second differential
of the distance moved, so
..
p
this equation can be rewritten as F = m . This differential equation would
be relatively easy to solve for a single vertex and one spring, but in a mesh
with hundreds of vertices interconnected by springs in three dimensions, we
have to use a computer to solve it and also use a numerical approximation to
the modeling equations.
Several numerical techniques can be used to solve the equations that simulate how the vertices respond to an applied disturbance (House and Breen [4]
discuss this in depth). One technique that is fast and simple and has been used
by physicists for solving dynamic particle problems has also been shown by
computer game developers to be particularly good for our type of problem. It
is called Verlet integration [9], and whilst it allows us to avoid the complexity
of carrying out an implicit solution for the equations, we can also still avoid
i
i
531
some of the instabilities that occur in the simpler explicit solution techniques.
If you want to know all about the numerical solution of such problems and
all the challenges this poses, a good place to start is with [3].
Combining the laws of Hooke and Newton, we obtain a second-order
differential equation in 3D:
n
..
k
F
p
a = i=
=
(pj pi ),
m
m
(18.1)
j=1
which we must solve at every vertex in the mesh. The term nj=1 (pj pi )
sums the change in the length of each of the n edges that are connected to
vertex i. We must discretize Equation (18.1) by replacing the second order
..
differential pi with an approximation. The key feature of Verlet integration
.
recognizes that the first order differential of p, that is pi , is a velocity, and that
if we are given initial values for position and velocity, we can write an iterative
pair of equations which may be used to determine how a vertex moves in
a short time interval t. If p is the current position of a vertex and v is
its current velocity then the updated position p and updated velocity v are
given by
v = v + a t,
p = p + v t.
Normally, the velocity is not calculated explicitly and is approximated by
a finite difference expression for the differential. If the previous position is
given by p then v = pp
t . Thus we can rewrite our expression to compute p
as
p = 2p p + a t 2 .
And of course, we need to update our previous position to our current
position; that is, p = p. By substituting for a as given in Equation (18.1),
we have a remarkably simple iterative algorithm that we can use to solve for
the transient behavior of a mesh. With the right choice of parameters, it
offers a visually reasonable approximation to the behavior of a squashy/elastic/
deformable object.
We can improve slightly on the spring-mass model of Equation (18.1) by
adding in the effect of a damper in parallel with the spring. This helps to make
the object appear less elastic. It is easy to model this effect mathematically
i
i
532
Listing 18.1. The sculpt programs most vital actions are dispatched from function
RefreshScene(), which has to do the data acquisition, all the collision detection
and simulation of elastic deformation within the available time interval (about 30 to
50 times per second).
i
i
533
18.8.3
Implementation
The polygonal mesh models we will use for the elastic surfaces and the sculpting tools are loaded as part of the program initialization. Rectangular bounding boxes are built around them. These bounding boxes are used so that
collision detection can be done as quickly as possible. That is, a rectangular
box around a sculpting tool is checked for intersection with a rectangular box
around each of the elastic objects. If the boxes intersect then every polygon
in a tool is checked for intersection against every polygon in the object.
As mentioned before, the key steps in the program are initiated from function RefreshScene(), which is shown in outline in Listing 18.1. The input
for the two instruments is obtained from two joysticks by using DirectInput.
The mesh models are rendered in real time using OpenGL, which allows for
stereoscopic display.
It only remains to sum up by saying that this has been quite a complex
project in terms of what it set out to achieve. But apart from the mathematical
model for deformation, all of the other elements arise just by the application
of common sense, and most of the coding details, e.g., OpenGL/DirectInput,
have been described before.
i
i
534
18.9
Summary
Well, this brings us to the end of our projects chapter and indeed to the end
of the book itself. There are many more project ideas we might have followed
up, but as we hope youve seen from these few examples, most of the elements
that such projects would be built on are not that different from those we
explored in detail in Part II. We hope that our book has shown you some
new things, excited your imagination to dream up new ways of using VR and
provided some clues and pieces of computer code to help you do just that.
Have fun!
Bibliography
[1] The Augmented Reality Toolkit. http://www.hitl.washington.edu/artoolkit/.
[2] M. Donahoo and K. Calvert. TCP/IP Sockets in C: Practical Guide for Programmers. San Fransisco, CA: Morgan Kaufmann, 2000.
[3] C. Gerald. Applied Numerical Analysis. Reading, MA: Addison Wesley, 1980.
[4] D. House and D. Breen (Editors). Cloth Modeling and Animation. Natick, MA:
A. K. Peters, 2000.
[5] C. Hunt. TCP/IP Network Administration, Second Edition. Sebastopol, CA:
OReilly, 1998.
[6] Intel Corporation. Open Source Computer Vision Library. http://www.intel.com/
technology/computing/opencv/, 2006
[7] Microsoft Corporation. DirectShow SDK. http://msdn.microsoft.com/
directx/sdk/, 2006.
[8] M. Pesce. Programming Microsoft DirectShow for Digital Video and Television.
Redmond, WA: Microsoft Press, 2003.
[9] L. Verlet. Computer Experiments on Classical Fluids. Physical Review 159
(1967) 98103.
i
i
Rotation with
Quaternions
i
i
538
A.1
The Quaternion
i
i
A.2
539
We have seen in Section 6.6.3 that the action of rotating a vector r from
one orientation to another may be expressed in terms of the application of
a transformation matrix R which transforms r to r = Rr, which has a new
orientation. The matrix R is independent of r and will perform the same
(relative) rotation on any other vector.
One can write R in terms of the Euler angles, i.e., as a function R(, , ).
However, the same rotation can also be achieved by specifying R in terms of a
unit vector n
and a single angle . That is, r is transformed into r by rotating
it round n
through . The angle is positive when the rotation takes place in
a clockwise direction when viewed along n
from its base. The two equivalent
rotations may be written as
r = R(, , )r;
)r.
r = R (, n
At first sight, it might seem difficult to appreciate that the same transformation can be achieved by specifying a single rotation round one axis as opposed
to three rotations round three orthogonal axes. It is also quite difficult to
) might be calculated, given the more naturally intuitive
imagine how (, n
and easier to specify Euler angles (, , ). However, there is a need for methods to switch from one representation to another.
In a number of important situations, it is necessary to use the
) representation. For example, the Virtual Reality Modeling
(, n
) specification to define the
Language (VRML) [2] uses the (, n
orientation adopted by an object in a virtual world.
Watt and Watt [4, p. 359] derive an expression that gives some insight
) specification of a rotational transform by
into the significance of the (, n
):
determining Rr in terms of (, n
n r)
n + (sin )
n r.
r = Rr = cos r + (1 cos )(
(A.1)
This expression is the vital link between rotational transformations and the
use of quaternions to represent them. To see this consider two quaternions:
1. p = (0, r), a quaternion formed by setting its scalar part to zero and its
vector part to r (the vector we wish to transform).
i
i
540
(A.2)
).
) as the quaternion q = (cos , sin n
2. Express matrix R(, n
2
2
3. Evaluate the quaternion product p = qpq1 .
4. Extract the transformed vector r from the vector component of p =
(0, r ). (Note that the scalar component of operations such as these will
always be zero.)
This isnt the end of our story of quaternions because, just as rotational
transformations in matrix form may be combined into a single matrix, rotations in quaternion form may be combined into a single quaternion by
multiplying their individual quaternion representations together. Lets look
at a simple example to show this point. Consider two rotations R1 and R2
which are represented by quaternions q1 and q2 respectively.
By applying q1 to the point p, we will end up at p . By further applying
q2 to this new point, we will be rotated to p ; that is,
p
p
p
p
=
=
=
=
q1 pq11 ;
q2 p q21 ;
q2 (q1 pq11 )q21 ;
(q2 q1 )p(q11 q21 ).
i
i
541
A.2.1
From the Euler angles (, , ), a quaternion that encapsulates the same information is constructed by writing quaternions for rotation of about the
z-axis, about the y-axis and about the x-axis:
qx
qy
qz
cos , sin , 0, 0 ,
2
2
= cos , 0, sin , 0 ,
2
2
,
= cos , 0, 0, sin
2
2
=
A.2.2
2
cos
cos
sin
cos
cos
cos
cos
sin
+ sin
cos
+ sin
sin
2
sin
sin
cos
sin
sin ,
2
2
sin ,
2
2
sin ,
2
cos .
2
2
i
i
542
is
1 2y2 2z 2
2xy + 2wz
2xz 2wy
0
2xy 2wz
2yz + 2wx
0
1 2x 2 2z 2
.
2
2
2xz + 2wy
2yz 2wx
1 2x 2y 0
0
0
0
1
A.3
0
0
0
1
q = [w, x, y, z],
A.4
Converting a Quaternion to
Euler Angles
To convert the quaternion to the equivalent Euler angles, first convert the
quaternion to an equivalent matrix, and then use the matrix-to-Euler angle
conversion algorithm. Sadly, matrix-to Euler-angle conversion is unavoidably
ill-defined because the calculations involve inverse trigonometric functions.
To make the conversion, use the algorithm shown in Figure A.2, which converts the matrix M with elements aij to Euler angles (, , ).
The angles (, , ) lie in the interval [, ], but they can be biased to
[0, 2] or some other suitable range if required.
i
i
543
w= w
1
w4 =
4w
x = w4 (a12 a21 )
y = w4 (a20 a02 )
z = w4 (a01 a10 )
}
else {
w=0
x = 12 (a11 + a22 )
if x > {
x= x
1
x2 =
2x
y = x2 a01
z = x2 a02
}
else {
x=0
y = 12 (1 a22 )
if y > {
y= y
a12
z=
2y
}
else {
y=0
z=1
}
}
}
Figure A.1. An algorithm for the conversion of rotational transformation matrix M
with coefficients ai,j to quaternion q with coefficients (w, x, y, z). The parameter is
the machine precision of zero. A reasonable choice would be 106 for floating-point
calculations. Note: only elements of M that contribute to rotation are considered in
the algorithm.
i
i
544
sin = a
02
cos = 1 sin2
if | cos | < {
It is not possible to distinguish heading
from pitch and the convention that
is 0 is assumed, thus:
sin = a21
cos = a11
sin = 0
cos = 1
}
else {
a12
sin =
cos
a22
cos =
cos
a01
sin =
cos
a00
cos =
cos
}
= ATAN 2(sin , cos )
= ATAN 2(sin , cos )
= ATAN 2(sin , cos )
Figure A.2. Conversion from rotational transformation matrix to the equivalent Euler
angles of rotation.
A.5
Interpolating Quaternions
i
i
545
tal flight paths for aircraft follow great circles; for example, the flight path
between London and Tokyo passes close to the North Pole. We can therefore say that just as the shortest distance between two points in a Cartesian
frame of reference is by a straight line; the shortest distance between two
(latitude, longitude) coordinates is along a path following a great circle. The
(latitude, longitude) coordinates at intermediate points on the great circle are
determined by interpolation, in this case by spherical interpolation.
Quaternions are used for this interpolation. We think of the end points of
the path being specified by quaternions, q0 and q1 . From these, a quaternion
qi is interpolated for any point on the great circle joining q0 to q1 . Conversion
of qi back to (latitude, longitude) allows the path to be plotted. In terms
of angular interpolation, we may think of latitude and longitude as simply
two of the Euler angles. When extended to the full set (, , ), a smooth
interpolation along the equivalent of a great circle is the result.
Using the concept of moving along a great circle as a guide to angular
interpolation, a spherical interpolation function, slerp(), may be derived. The
form of this function that works for interpolating between quaternions q0 =
[w0 , x0 , y0 , z0 ] and q1 = [w1 , x1 , y1 , z1 ] is given in [3] as
slerp(, q0 , q1 ) =
sin(1 )
sin
q0 +
q1 ,
sin
sin
where , the interpolation parameter, takes values in the range [0, 1]. The
angle is obtained from cos = q0 q1 = w0 w1 + x0 x1 + y0 y1 + z0 z1 .
Figure A.3. A great circle gives a path of shortest distance between two points on the
surface of a sphere. The arc between points X and Y is the shortest path.
i
i
546
= w0 w1 + x0 x1 + y0 y1 + z0 z1
if > 1 then normalize q0 and q1 by
dividing the components of q0 and q1 by
= cos1 ()
if || < {
0 = 1
1 =
}
else {
sin(1 )
0 =
sin
sin
1 =
sin
}
wi = 0 w0 + 1 w1
xi = 0 z 0 + 1 x1
yi = 0 y0 + 1 y1
zi = 0 z0 + 1 z1
Figure A.4. Algorithmic implementation of the slerp() function for the interpolated
quaternion qi = [wi , xi , yi , zi ].
Bibliography
[1] W. R. Hamilton. On Quaternions: Or on a New System of Imaginaries in
Algebra. Philosophical Magazine 25 (1844) 1014.
[2] J. Hartman, J. Wernecke and Silicon Graphics, Inc. The VRML 2.0 Handbook.
Reading, MA: Addison-Wesley Professional, 1996.
[3] K. Shoemake. Animating Rotation with Quaternion Curves. Proc. SIGGRAPH 85, Computer Graphics 19:3 (1985) 245254.
[4] A. Watt and M. Watt. Advanced Animation and Rendering Techniques: Theory
and Practice. Reading, MA: Addison-Wesley Professional, 1992.
i
i
The Generalized
Inverse
(B.1)
(B.2)
547
i
i
548
This is exactly the expression that we need in order to invert the n m Jacobian
matrix.
For more information on generalized inverses and their properties and
limitations, consult [1]. Specifically, the two points most important for IK are
the existence of an inverse for AAT and the fact that normally we have more
unknowns than equations (i.e., m > n). The practical implication of m > n
is that the articulation can attain its goal in more than one configuration.
Bibliography
[1] T. Boullion and P. Odell. Generalized Inverse Matrices. New York: John Wiley
and Sons, 1971.
i
i
Aligning Two
Images in a
Panoramic
Mosaic
In the following argument, we follow the work of Szeliski and Shum [2].
To register the image I1 (x ) with Ic (x) where x = H x, we first compute
an approximation to I1 , i.e.,
I1 (x) = I1 (H x),
and then find a deformation of I1 (x) which brings it into closer alignment
with Ic (x) by updating H .
549
i
i
550
d0 d1 d2
D = d3 d4 d5 ,
d6 d7 d8
with
(C.2)
x =
(1 + d0 )x + d1 y + d2
,
d6 x + d7 y + (1 + d8 )
(C.3)
y =
d3 x + (1 + d4 )y + d5
.
d6 x + d7 y + (1 + d8 )
(C.4)
Therefore
and
i
i
551
(C.6)
I1 I1
,
I1 (xi ) =
x y
and
x
=
d
x
d0
y
d0
xi
. . . . . . .
. . . . . . .
x
d8
y
d8
.
gTi
and
Ji =
T
I1 I1
,
=
x y
(C.7)
(C.8)
xi
x y 1 0 0 0 x 2 xy x
0 0 0 x y 1 xy y2 y
(C.9)
i
i
552
2
gTi JiT d + (I1 (xi ) Ic (xi )) ,
(C.10)
i (Ai d + ei )
= 0 j = 0, . . . , 8.
dj
Writing explicitly the case where j = 0,
i
(ai0 d0 + ai1 d1 + + ai8 d8 + ei )2
=0
d0
or by rearranging
i
ei ai0 .
i
i
553
i
ei ai1 ,
=
ai8 ai0 d0 + ai8 ai1 d1 + + ai8 ai8 d8 =
ei ai8 .
and if we resubstitute gTi JiT for Ai and Ji gi for ATi , we arrive at the normal
equations
Ji gi gTi JiT )d = (
ei Ji gi ).
(C.11)
(
i
A=
i
Ji gi gTi JiT
and
b=
ei Ji gi ,
which upon solution yields the nine coefficients of d. Hence, the updated
homography in Equation (C.1) may be determined. With a new H known,
the iterative algorithm may proceed.
The solution of Equation (C.11) may be obtained by a method such as
Cholesky decomposition [1] and the whole algorithm works well, provided
the alignment between the two images is only off by a few pixels. Following
the argument in [2], it is a good idea to use a modified form of the minimization procedure requiring only three parameters. This corresponds to the case
in which each image is acquired by making a pure rotation of the camera, i.e.,
there is no translational movement or change of focal length.
If a point p (specified in world coordinates (px , py , pz )) in the scene projects
to image location x (i.e., homogeneous coordinates (x, y, 1)) in image Ic then
i
i
554
after the camera and the world coordinate frame of reference have been rotated slightly, it will appear in image Ii to lie at p where
p = Rp,
(C.12)
f 0 0
F = 0 f 0 .
0 0 1
If we know x then we can write (up to an arbitrary scale factor) p = F 1 x.
Using Equation (C.12), one obtains the homography
x = FRF 1 x.
(C.13)
(C.14)
Since we are matching images that are only mildly out of alignment (by a
few pixels), the rotation of the camera which took the images must have been
small (say by an angle ). Using the small angle approximation (sin( )
, cos( ) 1) allows Equation (C.14) to be simplified as
n p).
p = p + (
(C.15)
p = (I + W )p,
(C.16)
i
i
where
555
0
wz wy
0
nz ny
0
wx = nz
0
nx .
W = wz
wy wx
0
ny nx
0
Comparing Equations (C.16) and (C.12), one can see that for small rotations
R = I + W , and so Equation (C.13) becomes
x = F (I + W )F 1 x = (I + FWF 1 )x.
(C.17)
Notice that Equation (C.17) is exactly the same as Equation (C.2) if we let
D = FWF 1 , and so the same argument and algorithm of Equations (C.2)
through (C.11) may be applied to determine the matrix of parameters W .
However, in the purely rotational case, W contains only three unknowns wx ,
wy and wz . This makes the solution of the linear Equations (C.11) much
more stable. A couple of minor modifications to the argument are required in
the case of the homography H (I + D)H with D given by D = FWF 1 .
w
The vector d becomes d = (0, wz , fwz , wz , 0, fwx , f y , wf x , 0). The
Taylor expansion in Equation (C.6) must account for the fact that d is really
only a function of three unknowns d w with w = (wx , wy , wz ). Thus it
becomes
x d
w + ,
(C.18)
I (x + w) = I (x) + I (x)
d w x
where d/w is the 9 3 matrix:
d0
wx
d1
wx
d2
wx
d
=
w
d8
wx
d0
wy
d1
wy
d2
wy
d8
wy
; i.e., d =
d8
wz
d0
wz
d1
wz
d2
wz
0
0
0
0
0
f
0
1/f
0
0
0
0
0
0
0
1/f
0
0
0
1
f
1
0
0
0
0
0
f
+
y
x
f
f
=
.
Ji T =
xy
y2
d w xi
x
f f
f
i
i
556
If the cameras focal length is not known, it may be estimated from one or
more of the transformations H obtained using the eight-parameter algorithm.
When H corresponds to a rotation only, we have seen that
h0 h1 h2
r00
r01 r02 f
H = h3 h4 h5 = FRF 1 r10
r11 r12 f .
h6 h7 1
r20 /f r21 /f r22
The symbol indicates equal up to scale, and so, for example, r02 = h2 /f .
We dont show it here, but for H to satisfy the rotation condition, the norm
of either the first two rows or the first two columns of R must be the same.
And at the same time, the first two rows or columns must also be orthogonal.
Using this requirement allows us to determine the camera focal length f with
h2 h5
if h0 h3 = h1 h4 ,
h0 h3 +h1 h4
f =
h5 2 h2 2
if h0 2 + h1 2 = h3 2 + h4 2 .
h2 +h2 h2 +h2
0
Bibliography
[1] W. Press et al. (editors). Numerical Recipes in C++: The Art of Scientific Computing, Second Edition. Cambridge, UK: Cambridge University Press, 2002.
[2] H. Shum and R. Szeliski. Panoramic Image Mosaics. Technical Report MSRTR-97-23, Microsoft Research, 1997.
i
i
A Minimal
Windows
Template
Program
A Windows program is made up of two main elements: the source code and
the resources that the compiled code uses as it executes. Resources are things
such as icons, little graphics to represent the mouse pointer, bitmaps to appear
in toolbars and a description of the dialog boxes. This information is gathered
together in a .RC file which is a simple text file that is normally only viewed by
using an integrate development environment such as Microsofts Developer
Studio. The resources are compiled in their own right (into a .RES file) and
then this and the object modules that result from the compilation of the
C/C++ source codes are combined with any libraries to form the executable
image of the application program. A Windows application will not only need
to be linked with the standard C libraries but also a number of libraries that
provide the graphical user interface (GUI) and system components.
A large proportion of the Window operating system is provided
in dynamically linked libraries (DLLs) that are loaded and used
only when the program is executing. SHELL32.DLL, GDI32.DLL
and USER32.DLL are examples of these components. When linking a program that needs to call a system function in one of
557
i
i
558
Listing D.1. The text of a Windows resource file before compilation. It shows how
resources are identified, a small menu is specified and a dialog box described.
i
i
559
Figure D.1. A typical collection of the files required to build a minimal Windows pro-
gram. Icons, mouse cursors and small bitmaps are all used to beautify the appearance
of the application program to the user.
DLLs are both a very clever idea and one of the biggest curses to
the Windows programmer. See Section 12.4 for more on DLLs.
It can be quite interesting to sneak a look at a few lines from a resource
file (see Listing D.1) and also to see a list of the files that need to be brought
together in the creation of even the smallest of Windows applications (see
Figure D.1).
All the work for the basic template is done by the code in the C source
file main.c and its co-named header file. The files StdAfx.h and Stdafx.h
are used to help speed up the build process by precompiling the rather large
system header files and can usually be left alone. In main.c, there are two
principal functions: the programs entry point and a function to process and
act on messages sent to the application by Windows. Handling messages is
the key concept in Windows programming. Messages are sent to the programs
handler function whenever the user selects a command from the menu bar or
moves the mouse across the window. Setting up the shape, size and behavior
of the applications user interface and interacting with Windows is done by
calling functions in the application programmer interface (API). All these
functions exist within one of the system DLLs (mentioned earlier). Function
prototypes for the API functions, useful constants and the stub libraries are
all provided and documented in the SDK.
For our minimal application, the template must exhibit the key features
of message loop, definition of window class and message handling function. It
will have an entry point in the conventional C sense, the WinMain() function.
i
i
560
i
i
561
i
i
562
LRESULT CALLBACK WndProc(HWND hWnd, UINT message, WPARAM wParam, LPARAM lParam){
// local variables to identify the menu command
int wmId, wmEvent;
// required to handle the WM_PAINT message
PAINTSTRUCT ps;
// allows us to draw into the client area of window
HDC hdc;
// act on the message
switch (message)
{
case WM_COMMAND:
// the low 16 bits of wParam give the menu command
wmId
= LOWORD(wParam);
wmEvent = HIWORD(wParam);
switch (wmId){
// process the menu command selected by the user
case IDM_EXIT:
// tell windows to send us back a WM_DESTROY message
DestroyWindow(hWnd);
break;
default:
return DefWindowProc(hWnd, message, wParam, lParam);
}
break;
case WM_PAINT:
// tells Windows "WE" are going to draw in client area
hdc = BeginPaint(hWnd, &ps);
// which gives us a "handle" to allow us to do so!
// Just for fun!
DrawText(hdc, "Hello", strlen("Hello"), &rt, DT_CENTER);
// paired with BeginPaint
EndPaint(hWnd, &ps);
break;
case WM_DESTROY:
// send a message to break out of loop in WinMain
PostQuitMessage(0);
break;
default:
// default message handling
return DefWindowProc(hWnd, message, wParam, lParam);
}
return 0;
}
i
i
563
is a bit like a pointer) to identify the window, an integer to identify the message and two 32-bit data words holding information for the message. The
message data could be a command identifier, or in the case of a user-defined
message, a pointer to a major data structure in memory.
Our minimal template acts upon three of the most commonly handled
messages:
1. WM COMMAND. This message typically arises from a programs menu or
a toolbar. The low-order 16 bits of the WPARAM data word hold the
integer identifier of the menu command selected. For example, in the
resources script file in Listing D.1, the EXIT command is given an
identifier of IDM EXIT. This is defined as the integer 1001 in the header
file resource.h. Since resource.h is included in both resource and
source files, the same symbolic representation for the exit command
may be used in both places.
2. WM PAINT. This is probably the most significant message from the point
of view of any graphical application program. Anything that is to appear inside the window (in its so called client area) must be drawn in
response to this message. So, the OpenGL, the DirectShow and the
Direct3D examples will all do their rendering in response to this message. To be able to draw in the client area,1 Windows gives the program
a handle (pointer) to what is called a device context (HDC hDC;) which
is needed by most drawing routines, including the 3D ones. The simple example in Listing D.3 draws the word Hello in response to the
WM PAINT message.
3. WM DESTROY. This is the message sent to the window when the close
button is clicked. If this is a programs main window, it usually is a signal that the program should terminate. PostQuitMessage(0) sends
the NULL message, which will break out of the message-processing
loop (Listing D.2) when it is processed for dispatch.
1
The client area is the part of the window within the border, below any menu and title bar.
Applications draw their output in the client area, in response to a WM PAINT message.
i
i
An MFC
Template
Program
In programs that use the Microsoft Foundation Class (MFC) library, the elements of message loop, window class and message processing function described in Appendix D are hidden inside the library classes. The application
programmer is thus mainly concerned with only having to write code to respond to the messages of interest to them. This is done in short handler
functions without the need for the big switch/case statement. This programming model is somewhat similar to that used in Visual Basic. For example,
one might want to process mouse button clicks and/or respond to a menu
command. These functions are formed by deriving application-specific C++
classes from the base MFC classes.
The MFC is based on the object-orientated programming model
exemplified by the C++ language. Each program is represented
by an object, called an application object. This has a document
object associated with it and one or more view objects to manage
the image of the document shown to the user of the program. In
turn, these objects are instances of a class derived from an MFC
base class.
The relationship amongst the key objects is illustrated in Figure E.1. Our
framework code will show how these classes are used in practice.
It is up to the programmer to decide what information is recorded about
the document and how that is processed. The MFC base classes help process
565
i
i
566
Figure E.1. The key elements in an application program using the MFC.
the document, for example reading or writing it to/from disk files. It is also
the responsibility of the application program to draw in the view objects
window(s) those aspect of the document it wishes to display. In the context
of the MFC, a document can be as diverse as a text file, a JPEG picture or a
3D polygonal CAD model.
The programs application, document and view objects are instances of
classes derived from the MFC base classes CWinApp, CDocument and CView
respectively. There are over 100 classes in the MFC. Figure E.2 illustrates the
hierarchical relationship amongst the ones related to drawing, documents and
views.
Perhaps the main advantage of the MFC is that it provides a number of
detailed user interface objects, for example, dockable toolbars, tooltips and
status information, with virtually no extra work on the part of the programmer. However, the documentation for the MFC is extensive and it takes quite
a while to become familiar with the facilities offered by it. The complexity
of the MFC has reached such a stage that it is more or less essential to use
an integrated development environment such as that provided by Visual C++
to help you get a simple application up and running. The so-called Application and Class wizards, which are nothing more than sophisticated codegenerating tools, make the construction of a Windows application relatively
painless. For more information on the use of the MFC and Visual C++ programming in general, consult one of the texts [1] or [2]. When you use the
Class wizard to create a template framework that will be suitable for programs
involving OpenGL, Direct3D or DirectShow, the developer tools will put the
i
i
567
Figure E.2. The MFC class hierarchy as it relates to documents and views. Many
classes have been omitted.
i
i
568
important classes into separate files, each with their own header to define the
derived class. The names of the files are derived from the application name
(in this example, it is FW).
Figure E.3, depicting a list of the files and classes in the template program,
contains a wealth of detail. As you can see, the Class View window contains a
lot of information. It lists four important classes:
1. Class CFwApp. This is the main application class. It is derived from the
MFC class CWinApp and the whole application is based on the static
creation of an object of this class. Here is the line of code from fw.cpp
that handles everything:
////////////////// The one and only CFwApp object!
CFwApp theApp; // create static instance of application object
i
i
569
Figure E.3. The file view (a) and the class view (b) of a basic MFC template application.
3. Class CFwView. This class represents the client area of the programs
window. Of prime importance in this class is the function1 that mimics
the WM PAINT message in a conventional Windows program. In this
class, it is the virtual function OnDraw(..) that gets called when a
WM PAINT message is received by the window. In the MFC, functions
that begin with On... are designated as message handler functions.
In Visual C++, the wizard that helps you build the program will also
1
In object-orientated programming, the functions in a class are also known as its methods.
We will use the terms function and method interchangeably.
i
i
570
helpfully add the code for most message handlers. Here is the code for
this one:
void CFwView::OnDraw(CDC* pDC){
// TO-DO: add draw code for drawing in the client area here
}
It shows macro functions which set up the link between window messages including menu or toolbar commands and the C++ code that
the programmer writes to handle the message. So, for example, in
our framework application we want to respond to commands from
the menu that displays an about dialog, create a new document or
open an existing one. To achieve this, the MESSAGE MAP would be
augmented with another macro making the link between the menu
command identifier and the handler function as follows:
i
i
571
BEGIN_MESSAGE_MAP(CFwApp, CWinApp)
//{{AFX_MSG_MAP(CFwApp)
ON_COMMAND(ID_APP_ABOUT, OnAppAbout)
//}}AFX_MSG_MAP
// Standard file based document commands
ON_COMMAND(ID_FILE_NEW, CWinApp::OnFileNew)
ON_COMMAND(ID_FILE_OPEN, CWinApp::OnFileOpen)
END_MESSAGE_MAP()
The rather bizarre-looking comments such as //}}AFX MSG MAP are inserted in the code by the helper wizard in Microsoft Visual C++ to
allow it to add, and later remove, message mappings and other features
without the programmer having to do it all herself. So, from the message map, one can say that if the programs user selects the File->Open
command from the menu, a WM COMMAND message will be generated, its
wParam decoded to a value of ID FILE OPEN and the CWinApp class
function called. (All of this happens in the background, deep inside the
code of the class library, and therefore it does not clutter up the application with pages of repeated code or a hugely bloated switch/case
construct.)
2. DoDataExchange. One of the most tedious jobs in building a large and
richly featured Windows program in C/C++ is writing all the code to
manage the user interface through dialog boxes and dialog box controls,
getting information back and forward from edit controls, handling selection in drop down and combo boxes etc., and then verifying it etc.
This is one area where the MFC offers a major advantage because it
builds into the classes associated with dialog controls a mechanism to
automatically transfer and validate the control values, copying back and
forward and converting from the text strings in text boxes etc. to/from
the programs variables. This mechanism is called DoDataExchange.
Our template code does not use this, but if you examine the template
code at the bottom of the fw.cpp file (on the CD) where the about
dialog class is declared, you will see the following lines of code:
//{{AFX_VIRTUAL(CAboutDlg)
protected:
virtual void DoDataExchange(CDataExchange* pDX);
//}}AFX_VIRTUAL
// DDX/DDV support
i
i
572
The effect of this is to specify a function that the mechanism of automatically exchanging and validating data in dialog boxes would use
if there was any data to exchange. Just to illustrate the point: had
the about dialog contained an edit control (for entering text) and the
CAboutDlg class contained a Cstring member to hold the text, then
including the following code for the DoDataExchange function:
will result in the text contents of the edit control being copied to the
m EditText member variable of the CAboutDlg class when the dialog
is dismissed by the user.
To sum up: the complete MFC framework (see Listings E.1 and E.2)
presents the key features of the application class, Listings E.3 and E.4 show
the document class and Listings E.5 and E.6 show the view class.
i
i
#include
#include
#include
#include
#include
573
"stdafx.h"
"Fw.h"
"MainFrm.h"
"FwDoc.h"
"FwView.h"
CFwApp theApp;
BEGIN_MESSAGE_MAP(CFwApp, CWinApp)
// message map for application
//{{AFX_MSG_MAP(CFwApp)
ON_COMMAND(ID_APP_ABOUT, OnAppAbout)
// NOTE - the ClassWizard will add and remove mapping macros here.
//
DO NOT EDIT what you see in these blocks of generated code!
//}}AFX_MSG_MAP
ON_COMMAND(ID_FILE_NEW, CWinApp::OnFileNew) // handle the NEW command here
ON_COMMAND(ID_FILE_OPEN, CWinApp::OnFileOpen)// handle Open command here
END_MESSAGE_MAP()
CFwApp::CFwApp(){ }
BOOL CFwApp::InitInstance(){
CSingleDocTemplate* pDocTemplate;
pDocTemplate = new CSingleDocTemplate(
IDR_MAINFRAME,
RUNTIME_CLASS(CFwDoc),
RUNTIME_CLASS(CMainFrame),
RUNTIME_CLASS(CFwView));
AddDocTemplate(pDocTemplate);
CCommandLineInfo cmdInfo;
ParseCommandLine(cmdInfo);
m_pMainWnd->ShowWindow(SW_SHOW);
m_pMainWnd->UpdateWindow();
return TRUE;
}
i
i
574
#include "stdafx.h"
#include "Fw.h"
#include "FwDoc.h"
// applicatoin header
// this class header
i
i
575
CFwDoc::CFwDoc(){}
CFwDoc::CFwDoc(){}
BOOL CFwDoc::OnNewDocument(){
// called in response to "New" on File menu
if (!CDocument::OnNewDocument())return FALSE;
..// put any code here to be executed when new document is created
return TRUE;
}
void CFwDoc::Serialize(CArchive& ar){
// This function is used to read
if (ar.IsStoring()){ .. }
// and write documents to/from disk.
else { ... }
}
////////////// other message handers appear here in the code /////////////////
// this function matches the MESSAGE_MAP
void CFwDoc::OnFileCommand1(){
..// put code to implement the command here
}
//{{AFX_VIRTUAL(CFwView)
public:
virtual void OnDraw(CDC* pDC); // overridden to draw this view
virtual BOOL PreCreateWindow(CREATESTRUCT& cs);
//}}AFX_VIRTUAL
virtual CFwView();
protected:
//{{AFX_MSG(CFwView)
// Generated message map functions
// NOTE - the ClassWizard will add and remove member functions here.
//
DO NOT EDIT what you see in these blocks of generated code !
//}}AFX_MSG
DECLARE_MESSAGE_MAP()
// this class also handles some messages
public:
afx_msg void OnFileCommand2();
// a handler for a command on the menu
};
i
i
576
#include
#include
#include
#include
"stdafx.h"
"Fw.h"
"FwDoc.h"
"FwView.h"
// standard headers
// application header
// this class header
i
i
Bibliography
577
Bibliography
[1] S. Stanfield. Visual C++ 6 How-To. Indianapolis, IN: Sams, 1997.
[2] J. Swanke. Visual C++ MFC Programming by Example. New York: McGraw
Hill, 1998.
i
i
The COM in
Practice
F.1
Using COM
As we said in Chapter 12, COM is architecture-neutral. It is a binary standard with a predefined layout for the interfaces a COM object exposes to the
outside world; the objects themselves never appear externally. Most COM
CoClasses and their interfaces are implemented in C++. The CoClass is just
a class (because the COM object itself is never exposed, it can be described
in a language-specific way) and a C++ abstract class is perfect for the job. Interfaces are tables of pointers to functions implemented by the COM object.
The table represents the interface, and the functions to which it points are the
methods of that interface.
The binary standard for an interface table is exactly analogous to C++
virtual function tables (known as vtables), and therefore if you are building a
COM object, this is the format in which you would define your interfaces.
Figure F.1 illustrates the relationship between the interface pointer, the vtable
pointers and the interfaces methods.
From the point of view of using a COM interface in a program, provided
one uses C++ it couldnt be simpler. Since a C++ pointer to an instance of
an object is always passed to the classs methods as an implicit parameter, (the
this parameter), when any of its methods are called, one can use the pointer
to the interface in the same way one would use any pointer to a function to
call that function. In C, the this argument and the indirection through the
vtable must be included explicitly.
579
i
i
581
Note the explicit use of the lpVtbl pointer and the repetition of the lpDD
pointer (which is the implied this pointer in C++). If one were writing any
COM program in C, this tedious process would have to be done every time
a COM interface method were usedor at least a macro would have to be
written to achieve the same thing.
F.2
COM Recipes
With the basics of COM fresh in our minds, this short section will explain
a few basic recipes that over the next few chapters are going to be deployed
to get the applications initialized. Taken together with the frameworks from
Appendices D and E, they will help us get our example programs ready for
business without having to repeatedly go over what is going on every time
we meet COM initialization. Therefore, what we need here are a few lines
of explanation of how COM is set up and configured for using Direct3D,
DirectInput and DirectShow.
With the introduction of DirectX 9, getting a 3D rendering application started is a lot simpler than it used to be in the earlier versions.
The behavior and operation of Direct3D is controlled by two important interfaces: the Direct3D System object interface and the Direct3D
device interface. It is necessary to get a pointer to the interface exposed
by the Direct3D system object and use it to create an object to manage the display. The display manager object exposes an interface (IDirect3DDevice) and its methods control the whole rendering process.
Note in the code that the Direct3DCreate(..) function is called in
a conventional (functional) manner. This is not the usual way to get
i
i
582
Before the application exits, we must ensure that the interfaces are released so that the COM system can free the memory associated with
the Direct3D9 objects if no one else is using them.
if( g_pd3dDevice != NULL )g_pd3dDevice->Release();
if( g_pD3D != NULL )g_pD3D->Release();
i
i
583
.
// pointer to the interface
IGraphBuilder * pGB;
.
// initialize the COM mechanism
CoInitializeEx(NULL, COINIT_APARTMENTTHREADED);
// Create an instance of the CoClass of the
CoCreateInstance(CLSID_FilterGraph, NULL,
// DirectShow FilterGraph object and return
CLSCTX_INPROC_SERVER,
// <- here) a pointer to the GraphBuilder
IID_IGraphBuilder, (void **)&pGB));
// interface.
.
.
// clean up
// release the graph builder interface
pGB->Release();
// close the COM mechanism
CoUninitialize();
.
You may remember we discussed using ATL (the active template library) to help us avoid the need to remember to release interfaces when
we are finished with them. When using the ATL, our DirectShow initialization code would be written as:
CoInitializeEx(NULL, COINIT_APARTMENTTHREADED);
CComPtr<IGraphBuilder>
pGB;
// GraphBuilder
pGB.CoCreateInstance(CLSID_FilterGraph, NULL, CLSCTX_INPROC);
.
CoUninitialize();
i
i
584
LPDIRECTINPUT8
g_pDI
LPDIRECTINPUTDEVICE8 g_pJoystick1
= NULL;
= NULL;
hr = DirectInput8Create(g_hInst,DIRECTINPUT_VERSION,
IID_IDirectInput8,(VOID**)&g_pDI,NULL);
i
i